10
A Double Residual Compression Algorithm for Efficient Distributed Learning Xiaorui Liu, Yao Li, Jiliang Tang and Ming Yan {xiaorui, liyao6, tangjili, myan}@msu.edu Michigan State University Abstract Large-scale machine learning models are of- ten trained by parallel stochastic gradient descent algorithms. However, the commu- nication cost of gradient aggregation and model synchronization between the master and worker nodes becomes the major obstacle for efficient learning as the number of work- ers and the dimension of the model increase. In this paper, we propose DORE, a DOu- ble REsidual compression stochastic gradient descent algorithm, to reduce over 95% of the overall communication such that the obstacle can be immensely mitigated. Our theoreti- cal analyses demonstrate that the proposed strategy has superior convergence properties for both strongly convex and nonconvex ob- jective functions. The experimental results validate that DORE achieves the best com- munication efficiency while maintaining simi- lar model accuracy and convergence speed in comparison with start-of-the-art baselines. 1 Introduction Stochastic gradient algorithms (Bottou, 2010) are effi- cient at minimizing the objective function f : R d R which is usually defined as f (x) := E ξ∼D [(x)], where (x) is the objective function defined on data sample ξ and model parameter x. A basic stochastic gradient descent (SGD) repeats the gradient “descent” step x k+1 = x k - γ g(x k ) where x k is the current iter- ation and γ is the step size. The stochastic gradient g(x k ) is computed based on an i.i.d. sampled mini- batch from the distribution of the training data D and Proceedings of the 23 rd International Conference on Artifi- cial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the au- thor(s). serves as the estimator of the full gradient f (x k ). In the context of large-scale machine learning, the num- ber of data samples and the model size are usually very large. Distributed learning utilizes a large number of computers/cores to perform the stochastic algorithms aiming at reducing the training time. It has attracted extensive attention due to the demand for highly effi- cient model training (Abadi et al., 2016; Chen et al., 2015; Li et al., 2014; You et al., 2018). In this paper, we focus on the data-parallel SGD (Dean et al., 2012; Lian et al., 2015; Zinkevich et al., 2010), which provides a scalable solution to speed up the training process by distributing the whole data to mul- tiple computing nodes. The objective can be written as: minimize xR d f (x)+ R(x)= 1 n n i=1 E ξ∼Di [(x)] | {z } :=fi(x) +R(x), where each f i (x) is a local objective function of the worker node i defined based on the allocated data un- der distribution D i and R : R d R is usually a closed convex regularizer. In the well-known parameter server framework (Li et al., 2014; Zinkevich et al., 2010), during each it- eration, each worker node evaluates its own stochastic gradient { e f i (x k )} n i=1 and send it to the master node, which collects all gradients and calculates their average (1/n) n i=1 e f i (x k ). Then the master node further takes the gradient descent step with the averaged gra- dient and broadcasts the new model parameter x k+1 to all worker nodes. It makes use of the computa- tional resources from all nodes. In reality, the network bandwidth is often limited. Thus, the communication cost for the gradient transmission and model synchro- nization becomes the dominating bottlenecks as the number of nodes and the model size increase, which hinders the scalability and efficiency of SGD. One common way to reduce the communication cost is to compress the gradient information by either gradi- ent sparsification or quantization (Alistarh et al., 2017;

A Double Residual Compression Algorithm for E cient ...proceedings.mlr.press/v108/liu20a/liu20a.pdf · Aji and Hea eld, 2017). Model synchronization. The typical way for model synchronization

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • A Double Residual Compression Algorithm for Efficient DistributedLearning

    Xiaorui Liu, Yao Li, Jiliang Tang and Ming Yan{xiaorui, liyao6, tangjili, myan}@msu.edu

    Michigan State University

    Abstract

    Large-scale machine learning models are of-ten trained by parallel stochastic gradientdescent algorithms. However, the commu-nication cost of gradient aggregation andmodel synchronization between the masterand worker nodes becomes the major obstaclefor efficient learning as the number of work-ers and the dimension of the model increase.In this paper, we propose DORE, a DOu-ble REsidual compression stochastic gradientdescent algorithm, to reduce over 95% of theoverall communication such that the obstaclecan be immensely mitigated. Our theoreti-cal analyses demonstrate that the proposedstrategy has superior convergence propertiesfor both strongly convex and nonconvex ob-jective functions. The experimental resultsvalidate that DORE achieves the best com-munication efficiency while maintaining simi-lar model accuracy and convergence speed incomparison with start-of-the-art baselines.

    1 Introduction

    Stochastic gradient algorithms (Bottou, 2010) are effi-cient at minimizing the objective function f : Rd → Rwhich is usually defined as f(x) := Eξ∼D[`(x, ξ)],where `(x, ξ) is the objective function defined on datasample ξ and model parameter x. A basic stochasticgradient descent (SGD) repeats the gradient “descent”step xk+1 = xk − γg(xk) where xk is the current iter-ation and γ is the step size. The stochastic gradientg(xk) is computed based on an i.i.d. sampled mini-batch from the distribution of the training data D and

    Proceedings of the 23rdInternational Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2020, Palermo,Italy. PMLR: Volume 108. Copyright 2020 by the au-thor(s).

    serves as the estimator of the full gradient ∇f(xk). Inthe context of large-scale machine learning, the num-ber of data samples and the model size are usually verylarge. Distributed learning utilizes a large number ofcomputers/cores to perform the stochastic algorithmsaiming at reducing the training time. It has attractedextensive attention due to the demand for highly effi-cient model training (Abadi et al., 2016; Chen et al.,2015; Li et al., 2014; You et al., 2018).

    In this paper, we focus on the data-parallel SGD (Deanet al., 2012; Lian et al., 2015; Zinkevich et al., 2010),which provides a scalable solution to speed up thetraining process by distributing the whole data to mul-tiple computing nodes. The objective can be writtenas:

    minimizex∈Rd

    f(x) +R(x) = 1n

    n∑i=1

    Eξ∼Di [`(x, ξ)]︸ ︷︷ ︸:=fi(x)

    +R(x),

    where each fi(x) is a local objective function of theworker node i defined based on the allocated data un-der distribution Di and R : Rd → R is usually a closedconvex regularizer.

    In the well-known parameter server framework (Liet al., 2014; Zinkevich et al., 2010), during each it-eration, each worker node evaluates its own stochasticgradient {∇̃fi(xk)}ni=1 and send it to the master node,which collects all gradients and calculates their average(1/n)

    ∑ni=1 ∇̃fi(xk). Then the master node further

    takes the gradient descent step with the averaged gra-dient and broadcasts the new model parameter xk+1

    to all worker nodes. It makes use of the computa-tional resources from all nodes. In reality, the networkbandwidth is often limited. Thus, the communicationcost for the gradient transmission and model synchro-nization becomes the dominating bottlenecks as thenumber of nodes and the model size increase, whichhinders the scalability and efficiency of SGD.

    One common way to reduce the communication cost isto compress the gradient information by either gradi-ent sparsification or quantization (Alistarh et al., 2017;

  • A Double Residual Compression Algorithm for Efficient Distributed Learning

    Seide et al., 2014; Stich et al., 2018; Strom, 2015; Wanget al., 2017; Wangni et al., 2018; Wen et al., 2017; Wuet al., 2018) such that many fewer bits of informationare needed to be transmitted. However, little attentionhas been paid on how to reduce the communicationcost for model synchronization and the correspondingtheoretical guarantees. Obviously, the model sharesthe same size as the gradient, so does the communica-tion cost. Thus, merely compressing the gradient canreduce at most 50% of the communication cost, whichsuggests the importance of model compression. No-tably, the compression of model parameters is muchmore challenging than gradient compression. One keyobstacle is that its compression error cannot be wellcontrolled by the step size γ and thus it cannot dimin-ish like that in the gradient compression (Tang et al.,2018). In this paper, we aim to bridge this gap by in-vestigating algorithms to compress the full communi-cation in the optimization process and understandingtheir theoretical properties. Our contributions can besummarized as:

    • We proposed DORE, which can compress both thegradient and the model information such that morethan 95% of the communication cost can be reduced.

    • We provided theoretical analyses to guarantee theconvergence of DORE under strongly convex andnonconvex assumptions without the bounded gradi-ent assumption.

    • Our experiments demonstrate the superior efficiencyof DORE comparing with the state-of-art baselineswithout degrading the convergence speed and themodel accuracy.

    2 Background

    Recently, many works try to reduce the communica-tion cost to speed up the distributed learning, espe-cially for deep learning applications, where the sizeof the model is typically very large (so is the size ofthe gradient) while the network bandwidth is relativelylimited. Below we briefly review relevant papers.

    Gradient quantization and sparsification. Re-cent works (Alistarh et al., 2017; Seide et al., 2014;Wen et al., 2017; Mishchenko et al., 2019; Bernsteinet al., 2018) have shown that the information of thegradient can be quantized into a lower-precision vec-tor such that fewer bits are needed in communicationwithout loss of accuracy. Seide et al. (2014) proposed1Bit SGD that keeps the sign of each element in thegradient only. It empirically works well, and Bern-stein et al. (2018) provided theoretical analysis sys-tematically. QSGD (Alistarh et al., 2017) utilizes an

    unbiased multi-level random quantization to compressthe gradient while Terngrad (Wen et al., 2017) quan-tizes the gradient into ternary numbers {0,±1}. InDIANA (Mishchenko et al., 2019), the gradient differ-ence is compressed and communicated contributing tothe estimator of the gradient in the master node.

    Another effective strategy to reduce the communica-tion cost is sparsification. Wangni et al. (2018) pro-posed a convex optimization formulation to minimizethe coding length of stochastic gradients. A more ag-gressive sparsification method is to keep the elementswith relatively larger magnitude in gradients, such astop-k sparsification (Stich et al., 2018; Strom, 2015;Aji and Heafield, 2017).

    Model synchronization. The typical way for modelsynchronization is to broadcast model parameters toall worker nodes. Some works (Wang et al., 2017;Jordan et al., 2019) have been proposed to reducemodel size by enforcing sparsity, but it cannot be ap-plied to general optimization problems. Some alter-natives including QSGD (Alistarh et al., 2017) andECQ-SGD (Wu et al., 2018) choose to broadcast allquantized gradients to all other workers such that ev-ery worker can perform model update independently.However, all-to-all communication is not efficient sincethe number of transmitted bits increases dramaticallyin large-scale networks. DoubleSqueeze (Tang et al.,2019) applies compression on the averaged gradientwith error compensation to speed up model synchro-nization.

    Error compensation. Seide et al. (2014) applied er-ror compensation on 1Bit-SGD and achieved negligibleloss of accuracy empirically. Recently, error compen-sation was further studied (Wu et al., 2018; Stich et al.,2018; Karimireddy et al., 2019) to mitigate the errorcaused by compression. The general idea is to add thecompressed error to the next compression step:

    ĝ = Q(g + e), e = (g + e)− ĝ.

    However, to the best of our knowledge, most of thealgorithms with error compensation (Wu et al., 2018;Stich et al., 2018; Karimireddy et al., 2019; Tang et al.,2019) need to assume bounded gradient, i.e., E‖g‖2 ≤B, and the convergence rate depends on this bound.

    Contributions of DORE. The most related papersto DORE are DIANA (Mishchenko et al., 2019) andDoubleSqueeze (Tang et al., 2019). Similarly, DIANAcompresses gradient difference on the worker side andachieves good convergence rate. However, it doesn’tconsider the compression in model synchronization, soat most 50% of the communication cost can be saved.DoubleSqueeze applies compression with error com-pensation on both worker and server sides, but it only

  • Xiaorui Liu, Yao Li, Jiliang Tang and Ming Yan

    considers non-convex objective functions. Moreover,its analysis relies on a bounded gradient assumption,i.e., E‖g‖2 ≤ B, and the convergence error has a de-pendency on the gradient bound like most existed errorcompensation works.

    In general, the uniform bound on the norm of thestochastic gradient is a strong assumption which mightnot hold in some cases. For example, it is violated inthe strongly convex case (Nguyen et al., 2018; Goweret al., 2019). In this paper, we design DORE, thefirst algorithm which utilizes gradient and model com-pression with error compensation without assumingbounded gradients. Unlike existing error compensa-tion works, we provide a linear convergence rate to theO(σ) neighborhood of the optimal solution for stronglyconvex functions and a sublinear rate to the stationarypoint for nonconvex functions with linear speedup. InTable 1, we compare the asymptotic convergence ratesof different quantized SGDs with DORE.

    3 Double Residual Compression SGD

    In this section, we introduce the proposed DOubleREsidual compression SGD (DORE) algorithm. Be-fore that, we introduce a common assumption for thecompression operator.

    In this work, we adopt an assumption from (Alistarhet al., 2017; Wen et al., 2017; Mishchenko et al., 2019)that the compression variance is linearly proportionalto the magnitude.

    Assumption 1. The stochastic compression operatorQ : Rd → Rd is unbiased, i.e., EQ(x) = x and satisfies

    E‖Q(x)− x‖2 ≤ C‖x‖2, (1)

    for a nonnegative constant C that is independent of x.We use x̂ to denote the compressed x, i.e., x̂ ∼ Q(x).

    Many feasible compression operators can be applied toour algorithm since our theoretical analyses are builton this common assumption. Some examples of feasi-ble stochastic compression operators include:

    • No Compression: C = 0 when there is no compres-sion.

    • Stochastic Quantization: A real number x ∈[a, b], (a < b) is set to be a with probability b−xb−aand b with probability x−ab−a , where a and b are pre-defined quantization levels (Alistarh et al., 2017).It satisfies Assumption 1 when ab > 0 and a < b.

    • Stochastic Sparsification: A real number x is set tobe 0 with probability 1− p and xp with probabilityp (Wen et al., 2017). It satisfies Assumption 1 withC = (1/p)− 1.

    Model residualModel

    residu

    alGradi

    entresidu

    al Gradient residual

    Master

    WorkerWorker Worker Worker

    Figure 1: An Illustration of DORE

    • p-norm Quantization: A vector x is quantizedelement-wisely by Qp(x) = ‖x‖p sign(x) ◦ ξ, where◦ is the Hadamard product and ξ is a Bernoulli ran-dom vector satisfying ξi ∼ Bernoulli( |xi|‖x‖p ). It sat-isfies Assumption 1 with C = maxx∈Rd

    ‖x‖1‖x‖p‖x‖22

    −1 (Mishchenko et al., 2019). To decrease the con-stant C for a higher accuracy, a vector x ∈ Rdcan be further decomposed into blocks, i.e., x =(x(1)>,x(2)>, · · · ,x(m)>)> with x(l) ∈ Rdl and∑ml=1 dl = d, and the blocks can be compressed

    independently.

    3.1 The Proposed DORE

    Many previous works (Alistarh et al., 2017; Seide et al.,2014; Wen et al., 2017) reduce the communication costof P-SGD by quantizing the stochastic gradient beforesending it to the master node, but there are severalintrinsic issues.

    First, these algorithms will incur extra optimizationerror intrinsically. Let’s consider the case when thealgorithm converges to the optimal point x∗ wherewe have (1/n)

    ∑ni=1∇fi(x∗) = 0. However, the

    data distributions may be different for different workernodes in general, and thus we may have ∇fi(x∗) 6=∇fj(x∗),∀i, j ∈ {1, . . . , n} and i 6= j. In other words,each individual ∇fi(x∗) may be far away from zero.This will cause large compression variance according toAssumption 1, which indicates that the upper boundof compression variance E‖Q(x)− x‖2 is linearly pro-portional to the magnitude of x.

    Second, most existing algorithms (Seide et al., 2014;Alistarh et al., 2017; Wen et al., 2017; Bernstein et al.,2018; Wu et al., 2018; Mishchenko et al., 2019) needto broadcast the model or gradient to all worker nodesin each iteration. It is a considerable bottleneck for ef-ficient optimization since the amount of bits to trans-mit is the same as the uncompressed gradient. Dou-

  • A Double Residual Compression Algorithm for Efficient Distributed Learning

    bleSqueeze (Tang et al., 2019) is able to apply com-pression on both worker and server sides. However, itsanalysis depends on a strong assumption on boundedgradient. Meanwhile, no theoretical guarantees areprovided for the convex problems.

    We proposed DORE to address all aforementionedissues. Our motivation is that the gradient shouldchange smoothly for smooth functions so that eachworker node can keep a state variable hki to track itsprevious gradient information. As a result, the resid-ual between new gradient and the state hki should de-crease, and the compression variance of the residualcan be well bounded. On the other hand, as the algo-rithm converges, the model would only change slightly.Therefore, we propose to compress the model resid-ual such that the compression variance can be mini-mized and also well bounded. We also compensate themodel residual compression error into next iteration toachieve a better convergence. Due to the advantages ofthe proposed double residual compression scheme, wecan derive the fastest convergence rate through analy-ses without the bounded gradient assumption. Beloware some key steps of our algorithm as showed in Al-gorithm 1 and Figure 1:

    [lines 4-9]: each worker node sends the compressedgradient residual (∆̂ki ) to the master node and updates

    its state hki with ∆̂ki ;

    [lines 13-15]: the master node gathers the com-pressed gradient residual ({∆̂ki )} from all worker nodesand recovers the averaged gradient ĝk based on itsstate hk;

    [lines 16]: the master node applies gradient descentalgorithms (possibly with the proximal operator);

    [lines 18-22]: the master node broadcasts the com-pressed model residual with error compensation (q̂k)to all worker nodes and updates the model;

    [lines 10-11]: each worker node receives the com-pressed model residual (q̂k) and updates its model xki .

    In the algorithm, the state hki serves as an exponen-tial moving average of the local gradient in expec-tation, i.e., EQhk+1i = (1 − α)hki + αgki , as provedin Lemma 1. Therefore, as the iteration approachesthe optimum, hki will also approach the local gradi-ent ∇fi(x∗) rapidly which contributes to small gradi-ent residual and consequently small compression vari-ance. Similar difference compression techniques arealso proposed in DIANA and its variance-reduced vari-ant (Mishchenko et al., 2019; Horváth et al., 2019).

    1Equations in the curly bracket are just notations forthe proof but does not need to computed actually.

    3.2 Discussion

    In this subsection, we provide more detailed dis-cussions about DORE including model initialization,model update, the special smooth case as well as thecompression rate of communication.

    Initialization. It is important to take the identicalinitialization x̂0 for all worker and master nodes. It iseasy to be ensured by either setting the same randomseed or broadcasting the model once at the beginning.In this way, although we don’t need to broadcast themodel parameters directly, every worker node updatesthe model x̂k in the same way. Thus we can keep theirmodel parameters identical. Otherwise, the model in-consistency needs to be considered.

    Model update. It is worth noting that although wecan choose an accurate model xk+1 as the next itera-tion in the master node, we use x̂k+1 instead. In thisway, we can ensure that the gradient descent algorithmis applied based on the exact stochastic gradient whichis evaluated on x̂ki at each worker node. This dispelsthe intricacy to deal with inexact gradient evaluatedon xk and thus it simplifies the convergence analysis.

    Smooth case. In the smooth case, i.e., R = 0, Algo-rithm 1 can be simplified. The master node quantizesthe recovered averaged gradient with error compensa-tion and broadcasts it to all worker nodes. The detailsof this simplified case can be found in Appendix A.4.

    Compression rate. The compression of the gradientinformation can reduce at most 50% of the commu-nication cost since it only considers compression dur-ing gradient aggregation while ignoring the model syn-chronization. However, DORE can further cut downthe remaining 50% communication.

    Taking the blockwise p-norm quantization as an exam-ple, every element of x can be represented by 32 bits us-ing the simple ternary coding {0,±1}, along with onemagnitude for each block. For example, if we considerthe uniform block size b, the number of bits to repre-sent a d-dimension vector of 32 bit float-point numberscan be reduced from 32d bits to 32db +

    32d bits. As long

    as the block size b is relatively large with respect tothe constant 32, the cost 32db for storing the float-pointnumber is relatively small such that the compressionrate is close to 32d/( 32d) ≈ 21.3 times (for example,19.7 times when b = 256).

    Applying this quantization, QSGD, Terngrad, MEM-SGD, and DIANA need to transmit (32d+ 32db +

    32d)

    bits per iteration and thus they are able to cut down47% of the overall 2 × 32d bits per iteration throughgradient compression when b = 256. But with DORE,we only need to transmit 2(32db +

    32d) bits per itera-

    tion. Thus DORE can reduce over 95% of the total

  • Xiaorui Liu, Yao Li, Jiliang Tang and Ming Yan

    Algorithm 1 The Proposed DORE.1

    1: Input: Stepsize α, β, γ, η, initialize h0 = h0i = 0d, x̂0i = x̂

    0, ∀i ∈ {1, . . . , n}.2: for k = 1, 2, · · · ,K − 1 do3: For each worker i ∈ {1, 2, · · · , n}:4: Sample gki such that E[gki |x̂ki ] = ∇fi(x̂ki )5: Gradient residual: ∆ki = g

    ki − hki

    6: Compression: ∆̂ki = Q(∆ki )

    7: hk+1i = hki + α∆̂

    ki

    8: { ĝki = hki + ∆̂ki }9: Send ∆̂ki to the master

    10: Receive q̂k from the master11: x̂k+1i = x̂

    ki + βq̂

    k

    12: For the master:

    13: Receive {∆̂ki } from workers14: ∆̂k = 1/n

    ∑ni ∆̂

    ki

    15: ĝk = hk + ∆̂k {= 1/n∑ni ĝ

    ki }

    16: xk+1 = proxγR(x̂k − γĝk)

    17: hk+1 = hk + α∆̂k

    18: Model residual: qk = xk+1 − x̂k + ηek19: Compression: q̂k = Q(qk)20: ek+1 = qk − q̂k21: x̂k+1 = x̂k + βq̂k

    22: Broadcast q̂k to workers23: end for24: Output: x̂K or any x̂Ki

    communication by compressing both the gradient andmodel transmission. More efficient coding techniquessuch as Elias coding (Elias, 1975) can be applied tofurther reduce the number of bits per iteration.

    4 Convergence Analysis

    To show the convergence of DORE, we will make thefollowing commonly used assumptions when needed.

    Assumption 2. Each worker node samples an un-biased estimator of the gradient stochastically withbounded variance, i.e., for i = 1, 2, · · · , n and ∀x ∈ Rd,

    E[gi|x] = ∇fi(x), E‖gi −∇fi(x)‖2 ≤ σ2i , (2)

    where gi is the estimator of ∇fi at x. In addition, wedefine σ2 = 1n

    ∑ni=1 σ

    2i .

    Assumption 3. Each fi is L-Lipschitz differentiable,i.e., for i = 1, 2, · · · , n and ∀x,y ∈ Rd,

    fi(x) ≤ fi(y) + 〈∇fi(y),x− y〉+ L2 ‖x− y‖2. (3)

    Assumption 4. Each fi is µ-strongly convex (µ ≥ 0),i.e., for i = 1, 2, · · · , n and ∀x,y ∈ Rd,

    fi(x) ≥ fi(y) + 〈∇fi(y),x− y〉+ µ2 ‖x− y‖2. (4)

    For simplicity, we use the same compression operatorfor all worker nodes, and the master node can applya different compression operator. We denote the con-stants in Assumption 1 as Cq and C

    mq for the worker

    and master nodes, respectively. Then we set α and βin both algorithms to satisfy

    1−√

    1− 4Cq(Cq+1)nc2(Cq+1)

    ≤ α ≤ 1+√

    1− 4Cq(Cq+1)nc2(Cq+1)

    ,

    0 < β ≤ 1Cmq +1 , (5)

    with c ≥ 4Cq(Cq+1)n . We consider two scenarios in thefollowing two subsections: f is strongly convex with aconvex regularizer R and f is non-convex with R = 0.

    4.1 The strongly convex case

    Theorem 1. Under Assumptions 1-4, if α and β inAlgorithm 1 satisfy (5), η and γ satisfy

    η < min

    (−Cmq +

    √(Cmq )

    2+4(1−(Cmq +1)β)2Cmq

    ,

    4µL(µ+L)2(1+cα)−4µL

    ), (6)

    η(µ+L)2(1+η)µL ≤γ ≤

    2(1+cα)(µ+L) , (7)

    then we have

    Vk+1 ≤ ρkV1 + (1+η)(1+ncα)n(1−ρ) βγ2σ2, (8)

    with

    Vk =β(1− (Cmq + 1)β)E‖qk−1‖2 + E‖x̂k − x∗‖2

    + (1+η)cβγ2

    n

    ∑ni=1 E‖hki −∇fi(x∗)‖2,

    ρ = max(

    (η2+η)Cmq1−(Cmq +1)β

    , 1 + ηβ − 2(1+η)βγµLµ+L , 1− α)< 1.

    Corollary 1. When there is no error compensationand we set η = 0, then ρ = max(1− 2βγµLµ+L , 1− α). Ifwe further set

    α = 12(Cq+1) , β =1

    Cmq +1, c =

    4Cq(Cq+1)n , (9)

    and choose the largest step-size γ = 2(µ+L)(1+2Cq/n) ,

    the convergent factor is

    (1−ρ)−1 = max(

    2(Cq+1), (Cmq +1)

    (µ+L)2

    2µL

    (12 +

    Cqn

    )).

    (10)

    Remark 1. In particular, suppose {∆i}ni=1 are com-pressed using the Bernoulli p-norm quantization withthe largest block size dmax, then Cq =

    1αw − 1, with

    αw = min0 6=x∈Rdmax‖x‖22

    ‖x‖1‖x‖p ≤ 1. Similarly, q is com-pressed using the Bernoulli p-norm quantization withCmq =

    1αm − 1. Then the linear convergent factor is

    (1−ρ)−1 = max{

    2αw ,

    1αm

    (µ+L)2

    µL

    (12−

    2n+

    2nαw

    )}. (11)

  • A Double Residual Compression Algorithm for Efficient Distributed Learning

    While the result of DIANA in (Mishchenko et al.,

    2019) is max{

    2αw ,

    µ+Lµ

    (12 −

    1n +

    1nαw

    )}, which is bet-

    ter than (11) with αm = 1 (no compression for themodel). When there is no compression for ∆i, i.e.,αw = 1, the algorithm reduces to the gradient descent,and the linear convergent factor is the same as that ofthe gradient descent for strongly convex functions.

    Remark 2. Although error compensation often im-proves the convergence empirically, in theory, no com-pensation, i.e., η = 0, provides the best convergencerate. This is because we don’t have much informationof the error being compensated. Filling this gap will bean interesting future direction.

    4.2 The nonconvex case with R = 0

    Theorem 2. Under Assumptions 1-3 and the addi-tional assumption that each worker samples the gra-dient from the full dataset, we set α and β accordingto (5). By choosing

    γ ≤ min{−1+√1+ 48L2β2(Cmq +1)2Cmq

    12Lβ(Cmq +1), 16Lβ(1+cα)(Cmq +1)

    },

    we have

    β2 − 3(1 + cα)(C

    mq + 1)Lβ

    K

    K∑k=1

    E‖∇f(x̂k)‖2

    ≤Λ1 − ΛK+1

    γK+

    3(Cmq + 1)(1 + ncα)Lβ2σ2γ

    n, (12)

    where

    Λk =(Cmq + 1)Lβ2‖qk−1‖2 + f(x̂k)− f∗

    + 3c(Cmq + 1)Lβ2γ2

    1

    n

    n∑i=1

    E‖hki ‖2. (13)

    Corollary 2. Let α = 12(Cq+1) , β =1

    Cmq +1, and c =

    4Cq(Cq+1)n , then 1 + ncα is a fixed constant. If γ =

    1

    12L(1+cα)(1+√K/n)

    , when K is relatively large, we have

    1

    K

    K∑k=1

    E‖∇f(x̂k)‖2 . 1K

    +1√Kn

    . (14)

    Remark 3. The dominant term in (14) isO(1/

    √Kn), which implies that the sample com-

    plexity of each worker node is O(1/(n�2)) in averageto achieve an �-accurate solution. It shows that,same as DoubleSqueeze in Tang et al. (2019), DOREis able to perform linear speedup. Furthermore,this convergence result is the same as the P-SGDwithout compression. Note that DoubleSqueeze has anextra term (1/K)

    23 , and its convergence requires the

    bounded variance of the compression operator.

    5 Experiment

    In this section, we validate the theoretical resultsand demonstrate the superior performance of DORE.Our experimental results demonstrate that (1) DOREachieves similar convergence speed as full-precisionSGD and state-of-art quantized SGD baselines and (2)its iteration time is much smaller than most existingalgorithms, supporting the superior communication ef-ficiency of DORE.

    To make a fair comparison, we choose the sameBernoulli ∞-norm quantization as described in Sec-tion 3 and the quantization block size is 256 for allexperiments if not being explicitly stated because ∞-norm quantization is unbiased and commonly used.The parameters α, β, η for DORE are chosen to be0.1, 1 and 1, respectively.

    The baselines we choose to compare include SGD,QSGD (Alistarh et al., 2017), MEM-SGD (Stichet al., 2018), DIANA (Mishchenko et al., 2019), Dou-bleSqueeze and DoubleSqueeze (topk) (Tang et al.,2019). SGD is the vanilla SGD without any compres-sion and QSGD quantizes the gradient directly. MEM-SGD is the QSGD with error compensation. DIANA,which only compresses and transmits the gradient dif-ference, is a special case of the proposed DORE. Dou-bleSqueeze quantizes both the gradient on the work-ers and the averaged gradient on the server with errorcompensation. Although DoubleSqueeze is claimed towork well with both biased and unbiased compression,in our experiment it converges much slower and suffersthe loss of accuracy with unbiased compression. Thus,we also compare with DoubleSqueeze using the Top-kcompression as presented in Tang et al. (2019).

    5.1 Strongly convex

    To verify the convergence for strongly convex andsmooth objective functions, we conduct the experi-ment on a linear regression problem: f(x) = ‖Ax −b‖2 + λ‖x‖2. The data matrix A ∈ R1200×500 andoptimal solution x∗ ∈ R500 are randomly synthesized.Then we generate the prediction b by sampling from aGaussian distribution whose mean is Ax∗. The rowsof the data matrix A are allocated evenly to 20 workernodes. To better verify the linear convergence to theO(σ) neighborhood around the optimal solution, wetake the full gradient in each node for all algorithmsto exclude the effect of the gradient variance (σ = 0).

    As showed in Figure 3, with full gradient and a con-stant learning rate, DORE converges linearly, sameas SGD and DIANA, but QSGD, MEM-SGD, Doub-leSqueeze, as well as DoubleSqueeze (topk) convergeto a neighborhood of the optimal point. This is be-

  • Xiaorui Liu, Yao Li, Jiliang Tang and Ming Yan

    Algorithm Compression Compression Assumed Linear rate Nonconvex RateQSGD Grad 2-norm Quantization N/A 1K +BDIANA Grad p-norm Quantization X 1√

    Kn+ 1K

    DoubleSqueeze Grad+Model Bounded Variance N/A 1√Kn

    + 1K2/3

    + 1KDORE Grad+Model Assumption 1 X 1√

    Kn+ 1K

    Table 1: A comparison between related algorithms. DORE is able to converges linearly to the O(σ) neighborhoodof optimal point like full-precision SGD and DIANA in the strongly convex case while achieving much bettercommunication efficiency. DORE also admits linear speedup in the nonconvex case like DoubleSqueeze butDORE doesn’t require the assumptions of bounded compression error or bounded gradient.

    0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 0.0040 0.0045 0.0050

    Bandwidth (1s/Mbit)

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    Se

    co

    nd

    s

    Figure 2: Per iteration time cost on Resnet18 for SGD,QSGD, and DORE. It is tested in a shared clusterenvironment connected by Gigabit Ethernet interface.DORE speeds up the training process significantly bymitigating the communication bottleneck.

    cause these algorithms assume the bounded gradientand their convergence errors depend on that bound.Although they converge to the optimal solution usinga diminishing step size, their converge rates will bemuch slower.

    In addition, we also validate that the norms of the gra-dient and model residual decrease exponentially, whichexplains the linear convergence behavior of DORE. Formore details, please refer to Appendix A.1.

    5.2 Nonconvex

    To verify the convergence in the nonconvex case, wetest the proposed DORE with two classical deep neu-ral networks on two representative datasets, respec-tively, i.e., LeNet (Lecun et al., 1998) on MNIST andResnet18 (He et al., 2016) on CIFAR10. In the exper-iment, we use 1 parameter server and 10 workers, eachof which is equipped with an NVIDIA Tesla K80 GPU.The batch size for each worker node is 256. We use0.1 and 0.01 as the initial learning rates for LeNet andResnet18, and decrease them by a factor of 0.1 afterevery 25 and 100 epochs, respectively. All parametersettings are the same for all algorithms.

    Figures 4 and 5 show the training loss and test lossfor each epoch during the training of LeNet on theMNIST dataset and Resnet18 on CIFAR10 dataset.The results indicate that in the nonconvex case, evenwith both compressed gradient and model information,DORE can still achieve similar convergence speed asfull-precision SGD and other quantized SGD variants.DORE achieves much better convergence speed thanDoubleSqueeze using the same compression methodand converges similarly with DoubleSqueeze with Topkcompression as presented in (Tang et al., 2019). Wealso validate via parameter sensitivity in Appendix A.3that DORE performs consistently well under differ-ent parameter settings such as compression block size,α, β and η.

    5.3 Communication efficiency

    In terms of communication cost, DORE enjoys thebenefit of extremely efficient communication. As oneexample, under the same setting as the Resnet18 ex-periment described in the previous section, we testthe time cost per iteration for SGD, QSGD, andDORE under varied network bandwidth. We didn’ttest MEM-SGD, DIANA, and DoubleSqueeze becauseMEM-SGD, DIANA have similar time cost as QSGDwhile DoubleSqueeze has similar time cost as DORE.The result showed in Figure 2 indicates that as thebandwidth becomes worse, with both gradient andmodel compression, the advantage of DORE becomesmore remarkable compared to the baselines that don’tapply compression for model synchronization. In Ap-pendix A.2, we also demonstrate the communicationefficiency in terms of communication bits and runningtime, which clearly suggests the benefit of the proposedalgorithm.

    6 Conclusion

    Communication cost is the severe bottleneck for dis-tributed training of modern large-scale machine learn-ing models. Extensive works have compressed the gra-dient information to be transferred during the training

  • A Double Residual Compression Algorithm for Efficient Distributed Learning

    0 2000 4000 6000 8000 10000

    Iteration

    1014

    1011

    108

    105

    102

    101

    ||xk

    x*||2

    (a) Learning rate=0.05

    0 2500 5000 7500 10000 12500 15000 17500 20000

    Iteration

    109

    107

    105

    103

    101

    101

    ||xk

    x*||2

    (b) Learning rate=0.025

    Figure 3: Linear regression on synthetic data. When the learning rate is 0.05, DoubleSqueeze diverges. Inboth cases, DORE, SGD, and DIANA converge linearly to the optimal point, while QSGD, MEM-SGD, Doub-leSqueeze, and DoubleSqueeze (topk) only converge to the neighborhood even when full gradient is available.

    0 10 20 30 40 50 60 70

    Epoch

    0.0

    0.5

    1.0

    1.5

    2.0

    Tra

    inin

    g L

    oss

    (a) Training loss

    0 10 20 30 40 50 60 70

    Epoch

    0.0

    0.5

    1.0

    1.5

    2.0

    Test Loss

    (b) Test Loss

    Figure 4: LeNet trained on MNIST. DORE converges similarly as most baselines. It outperforms DoubleSqueezeusing the same compression method while has similar performance as DoubleSqueeze (topk).

    0 50 100 150 200 250

    Epoch

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    4.0

    4.5

    Tra

    inin

    g L

    oss

    (a) Training Loss

    0 50 100 150 200 250

    Epoch

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    4.0

    4.5

    Test Loss

    (b) Test Loss

    Figure 5: Resnet18 trained on CIFAR10. DORE achieves similar convergence and accuracy as most baselines.DoubeSuqeeze converges slower and suffers from the higher loss but it works well with topk compression.

    process, but model compression is rather limited dueto its intrinsic difficulty. In this paper, we proposedthe Double Residual Compression SGD named DOREto compress both gradient and model communicationthat can mitigate this bottleneck prominently. The

    theoretical analyses suggest good convergence rate ofDORE under weak assumptions. Furthermore, DOREis able to reduce 95% of the communication cost whilemaintaining similar convergence rate and model accu-racy compared with the full-precision SGD.

  • Xiaorui Liu, Yao Li, Jiliang Tang and Ming Yan

    7 Acknowledgement

    Xiaorui Liu and Jiliang Tang are supported by the Na-tional Science Foundation (NSF) under grant numbersIIS-1714741, IIS-1715940, IIS-1845081, IIS-1907704,IIS-1928278 and CNS-1815636 and a grant from CriteoFaculty Research Award. Yao Li and Ming Yan aresupported by the National Science Foundation (NSF)under grant number DMS-1621798.

    References

    Abadi, M., Barham, P., Chen, J., Chen, Z., Davis,A., Dean, J., Devin, M., Ghemawat, S., Irving, G.,Isard, M., Kudlur, M., Levenberg, J., Monga, R.,Moore, S., Murray, D. G., Steiner, B., Tucker, P.,Vasudevan, V., Warden, P., Wicke, M., Yu, Y., andZheng, X. (2016). TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposiumon Operating Systems Design and Implementation,OSDI’16, pages 265–283.

    Aji, A. F. and Heafield, K. (2017). Sparse commu-nication for distributed gradient descent. In Pro-ceedings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing, pages 440–445,Copenhagen, Denmark. Association for Computa-tional Linguistics.

    Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vo-jnovic, M. (2017). QSGD: Communication-efficientSGD via gradient quantization and encoding. In Ad-vances in Neural Information Processing Systems,pages 1709–1720.

    Bernstein, J., Wang, Y., Azizzadenesheli, K., andAnandkumar, A. (2018). SIGNSGD: compressed op-timisation for non-convex problems. In Proceedingsof the 35th International Conference on MachineLearning, ICML 2018, Stockholmsmässan, Stock-holm, Sweden, July 10-15, 2018, pages 559–568.

    Bottou, L. (2010). Large-scale machine learning withstochastic gradient descent. In Lechevallier, Y.and Saporta, G., editors, Proceedings of COMP-STAT’2010, pages 177–186, Heidelberg. Physica-Verlag HD.

    Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M.,Xiao, T., Xu, B., Zhang, C., and Zhang, Z. (2015).MXNet: A flexible and efficient machine learning li-brary for heterogeneous distributed systems. CoRR,abs/1512.01274.

    Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin,M., Le, Q. V., Mao, M. Z., Ranzato, M., Senior, A.,Tucker, P., Yang, K., and Ng, A. Y. (2012). Largescale distributed deep networks. In Proceedings ofthe 25th International Conference on Neural Infor-

    mation Processing Systems - Volume 1, NIPS’12,pages 1223–1231, USA.

    Elias, P. (1975). Universal codeword sets and repre-sentations of the integers. IEEE Transactions onInformation Theory, 21(2):194–203.

    Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A.,Shulgin, E., and Richtárik, P. (2019). SGD: Gen-eral analysis and improved rates. arXiv preprintarXiv:1901.09401.

    He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deepresidual learning for image recognition. 2016 IEEEConference on Computer Vision and Pattern Recog-nition (CVPR), pages 770–778.

    Horváth, S., Kovalev, D., Mishchenko, K., Stich, S.,and Richtárik, P. (2019). Stochastic distributedlearning with gradient quantization and variance re-duction. arXiv preprint arXiv:1904.05115.

    Jordan, M. I., Lee, J. D., and Yang, Y. (2019).Communication-efficient distributed statistical in-ference. Journal of the American Statistical Associ-ation, 114(526):668–681.

    Karimireddy, S. P., Rebjock, Q., Stich, S. U., andJaggi, M. (2019). Error feedback fixes SignSGD andother gradient compression schemes. In Proceed-ings of the 36th International Conference on Ma-chine Learning, volume 97 of Proceedings of Ma-chine Learning Research, pages 3252–3261. PMLR.

    Lecun, Y., Bottou, L., Bengio, Y., and Haffner,P. (1998). Gradient-based learning applied todocument recognition. Proceedings of the IEEE,86(11):2278–2324.

    Li, M., Andersen, D. G., Park, J. W., Smola, A. J.,Ahmed, A., Josifovski, V., Long, J., Shekita, E. J.,and Su, B.-Y. (2014). Scaling distributed machinelearning with the parameter server. In Proceedings ofthe 11th USENIX Conference on Operating SystemsDesign and Implementation, OSDI’14, pages 583–598, Berkeley, CA, USA. USENIX Association.

    Lian, X., Huang, Y., Li, Y., and Liu, J. (2015). Asyn-chronous parallel stochastic gradient for nonconvexoptimization. In Cortes, C., Lawrence, N. D., Lee,D. D., Sugiyama, M., and Garnett, R., editors, Ad-vances in Neural Information Processing Systems28, pages 2737–2745. Curran Associates, Inc.

    Mishchenko, K., Gorbunov, E., Takáč, M., andRichtárik, P. (2019). Distributed learning withcompressed gradient differences. arXiv preprintarXiv:1901.09269.

    Nguyen, L., NGUYEN, P. H., van Dijk, M., Richtarik,P., Scheinberg, K., and Takac, M. (2018). SGD andhogwild! Convergence without the bounded gradi-ents assumption. In Dy, J. and Krause, A., editors,

  • A Double Residual Compression Algorithm for Efficient Distributed Learning

    Proceedings of the 35th International Conference onMachine Learning, volume 80 of Proceedings of Ma-chine Learning Research, pages 3750–3758, Stock-holmsmässan, Stockholm Sweden. PMLR.

    Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. (2014).1-bit stochastic gradient descent and application todata-parallel distributed training of speech DNNs.In Interspeech 2014.

    Stich, S. U., Cordonnier, J.-B., and Jaggi, M. (2018).Sparsified SGD with memory. In Proceedings of the32Nd International Conference on Neural Informa-tion Processing Systems, NIPS’18, pages 4452–4463,USA. Curran Associates Inc.

    Strom, N. (2015). Scalable distributed DNN trainingusing commodity GPU cloud computing. In INTER-SPEECH, pages 1488–1492.

    Tang, H., Gan, S., Zhang, C., Zhang, T., and Liu, J.(2018). Communication compression for decentral-ized training. In Advances in Neural InformationProcessing Systems 31: Annual Conference on Neu-ral Information Processing Systems 2018, NeurIPS2018, 3-8 December 2018, Montréal, Canada., pages7663–7673.

    Tang, H., Yu, C., Lian, X., Zhang, T., and Liu, J.(2019). DoubleSqueeze: Parallel stochastic gradientdescent with double-pass error-compensated com-pression. In Proceedings of the 36th InternationalConference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages6155–6165.

    Wang, J., Kolar, M., Srebro, N., and Zhang, T. (2017).Efficient distributed learning with sparsity. In Pre-cup, D. and Teh, Y. W., editors, Proceedings of the34th International Conference on Machine Learn-ing, volume 70 of Proceedings of Machine LearningResearch, pages 3636–3645, International Conven-tion Centre, Sydney, Australia. PMLR.

    Wangni, J., Wang, J., Liu, J., and Zhang, T. (2018).Gradient sparsification for communication-efficientdistributed optimization. In Proceedings of the32Nd International Conference on Neural Informa-tion Processing Systems, NIPS’18, pages 1306–1316,USA. Curran Associates Inc.

    Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen,Y., and Li, H. (2017). TernGrad: Ternary gradientsto reduce communication in distributed deep learn-ing. In Advances in neural information processingsystems, pages 1509–1519.

    Wu, J., Huang, W., Huang, J., and Zhang, T. (2018).Error compensated quantized SGD and its appli-cations to large-scale distributed optimization. InDy, J. and Krause, A., editors, Proceedings of the

    35th International Conference on Machine Learn-ing, volume 80 of Proceedings of Machine LearningResearch, pages 5325–5333.

    You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J., andKeutzer, K. (2018). ImageNet training in minutes.In Proceedings of the 47th International Conferenceon Parallel Processing, ICPP 2018, pages 1:1–1:10,New York, NY, USA. ACM.

    Zinkevich, M., Weimer, M., Li, L., and Smola, A. J.(2010). Parallelized stochastic gradient descent. InLafferty, J. D., Williams, C. K. I., Shawe-Taylor,J., Zemel, R. S., and Culotta, A., editors, Advancesin Neural Information Processing Systems 23, pages2595–2603. Curran Associates, Inc.