arXiv:2003.10637v1 [cs.LG] 24 Mar 2020 · 2 R. Liu et al. Table 1. Overview Comparison (the number of epochs E, the dimension of a gradient vector d, the compressed dimension q, the

FedSel: Federated SGD under Local DifferentialPrivacy with Top-k Dimension Selection

Ruixuan Liu1, Yang Cao2, Masatoshi Yoshikawa2, Hong Chen1(B)

1Renmin University of China, Beijing, Chinaruixuan.liu, [email protected]

2Kyoto University, Kyoto, Japanyang, [email protected]

Abstract. As massive data are produced from small gadgets, federatedlearning on mobile devices has become an emerging trend. In the feder-ated setting, Stochastic Gradient Descent (SGD) has been widely usedin federated learning for various machine learning models. To preventprivacy leakages from gradients that are calculated on users’ sensitivedata, local differential privacy (LDP) has been considered as a privacyguarantee in federated SGD recently. However, the existing solutionshave a dimension dependency problem: the injected noise is substantiallyproportional to the dimension d. In this work, we propose a two-stageframework FedSel for federated SGD under LDP to relieve this problem.Our key idea is that not all dimensions are equally important so thatwe privately select Top-k dimensions according to their contributions ineach iteration of federated SGD. Specifically, we propose three privatedimension selection mechanisms and adapt the gradient accumulationtechnique to stabilize the learning process with noisy updates. We alsotheoretically analyze privacy, accuracy and time complexity of FedSel,which outperforms the state-of-the-art solutions. Experiments on real-world and synthetic datasets verify the effectiveness and efficiency of ourframework.

Keywords: Local Differential Privacy · Federated Learning.

1 Introduction

Nowadays, massive peripheral gadgets, such as mobile phones and wearable de-vices, produce an enormous volume of personal data, which boosts the processof Federated Learning (FL). In the cardinal FL setting, Stochastic Gradient De-scent (SGD) has been widely used [1–4] for various machine learning models,such as Logistic Regression, Support Vector Machine and Neural Networks. Theserver iteratively updates a global model for E epochs based on gradients of theobjective function, which are collected from batches of m clients’ d-dimensionallocal updates. Nevertheless, users’ data are still under threats of privacy at-tacks [5–8] if the raw gradients are transmitted to an untrusted server. Local

arX

iv:2

003.

1063

7v1

[cs

.LG

] 2

4 M

ar 2

020

2 R. Liu et al.

Table 1. Overview Comparison (the number of epochs E, the dimension of a gradientvector d, the compressed dimension q, the privacy budget for training ε, the amount ofparticipants in one aggregation m. See detailed analyses in Sects.3.1 and 4.2).

LDP-SGD solutions upper bound of noise lower bound of batch size

flat solution [12–14] O(E√d log dε√m

) Ω(E2d log dε2

)

compressed solution [9] O(E√q log qε√m

) Ω(E2q log qε2

)

our solution O(E√log d

ε√md

) Ω(E2 log ddε2

)

differential privacy (LDP) provides a rigorous guarantee to perturb users’ sen-sitive data before sending to an untrusted server. Many LDP mechanisms havebeen proposed for different computational tasks or data types, such as matrixfactorization [9], key-valued data [10,11] and multidimensional data [12–16].

However, applying LDP to federated SGD faces a nontrivial challenge whenthe dimension d is large. First, although some LDP mechanisms [12–14] proposedfor multidimensional data are shown to be applicable to SGD (i.e., flat solutionin Table 1), the injected noise is substantially proportional to the dimension d.Besides, in order to obtain an acceptable accuracy, the required batch size ofclients (i.e., m) linearly depends on d. As the clients in federated learning havethe full autonomy for their local data and can decide when and how to join thetraining process, a large required batch size impedes the model’s applicability inpractice. Second, a recent work [9] (i.e., compressed solution in Table 1) attemptsto solve this problem by reducing the dimension from d to q with random pro-jection [17], which is a well-studied dimension reduction technique. However, asthis method randomly discards some dimensions, it introduces a large recoveryerror which may damage the learning performance.

Our idea is that not all the dimensions are equally “important” so thatwe privately select Top-k dimensions according to their contributions (i.e., theabsolute values of gradients) in each iteration of SGD. A simple method forprivate Top-1 selection is to employ exponential mechanism [18] that returns aprivate dimension with a probability proportional to its absolute value of gradi-ent. However, the challenge is how to design Top-k selection mechanisms underLDP for federated SGD with better selection strategies. Besides the absence ofsuch private Top-k selection mechanisms, another challenge is that discardingdelayed gradients causes the convergence issues, especially in our private settingwhere extra noises are injected. Although we can accumulate delayed gradientswith momentum [19], it still requires additional design to stabilize the learningprocess with noisy updates.

Contributions: In this work, we take the first attempt to mitigate thedimension dependency problem in federated learning with LDP. We design, im-plement and evaluate a two-stage ε-LDP framework for federated SGD. As shownin Table 1, comparing to the state-of-the-art techniques, our framework relievesthe effect of the number of dimensions d on the injected noise and the requiredbatch size (i.e., size of clients in each iteration; the lower, the better). Our con-tributions are summarized below.

FedSel: Federated SGD under LDP with Top-k Dimension Selection 3

– First, we propose a two-stage framework FedSel with Dimension Selectionand Value Perturbation. With our privacy analysis in Sect.3.2, this frame-work satisfies ε-LDP for any client’s local vector. Our theoretical utilityanalysis in Sect.3.3 shows that it significantly reduces the dimensional de-pendency on LDP estimation error fromO(

√d) toO(1/

√d). We also enhance

the framework to avoid the loss of accuracy by accumulating delayed gra-dients. Intuitively, delayed gradients can improve the empirical performanceand fix the convergence issues. In order to further stabilize the learning in theprivate setting with noises injected, we modify an existing accumulation [19].Our analysis and experiments validate that this modification reduces thevariance of noisy updates. (Section 3)

– Second, we instantiate the Dimension Selection stage with three mecha-nisms, which are general and independent of the second stage of value per-turbation. The privacy guarantee for the selection is provided. Besides, weshow the advance of utility and computation cost by extending Top-1 toTop-k case with analysis and experiments. (Section 4)

– Finally, we perform extensive experiments on synthetic and real-world datasetsto evaluate the proposed framework and the private Top-k dimension selec-tion mechanisms. We also implement a hyper-parameters-free strategy toautomatically allocate the privacy budgets between the two stages with bet-ter utility. Significant improvements are shown in test accuracy comparingwith the existing solutions [9, 12–14]. (Section 5)

The remainder of this paper is organized as follows. Section 2 presents thetechnical background. Section 3 illustrates the two-stage privatized frameworkwith analyses. Section 4 proposes the Exponential Mechanism (EXP) for Top-1selection and extends to Top-k case with Perturbed Encoding Mechanism (PE)and Perturbed Sampling Mechanism (PS). Section 5 provides results on bothsynthetic and real-world datasets and a hyper-parameters-free strategy. Section6 gives an overview of related works. Section 7 concludes the paper.

2 Preliminaries

2.1 Federated SGD

Suppose a learning task defines the objective loss function L(w;x, y) on example(x, y) with parameters w ∈ Rd. The goal of learning is to construct an empirical

minimization as w∗ = arg minw1N

∑Ni=1 L(w;xi, yi) over N clients’ data. For

a single iteration, a batch of m clients updates local models in parallel withthe distributed global parameters. Then they transmit local model updates tothe server for an average mean gradient to update the global model as: wt ←wt−1 − α 1

m

∑mi=15L(wt−1;xi, yi). Without loss of generality, we describe our

framework with the classic setting with one local update for each round.

4 R. Liu et al.

2.2 Local Differential Privacy

Local differential privacy (LDP) is proposed for collecting sensitive data throughlocal perturbations without any assumption on a trusted server.M is a random-ized algorithm that takes a vector v as input and outputs a perturbed vector v∗.ε-LDP is defined on M with a privacy budget ε as follows.

Definition 1 (Local Differential Privacy [20]). A randomized algorithmMsatisfies ε-LDP if and only if the following is true for any two possible inputsv, v′ ∈ V and output v∗: Pr[M(v) = v∗] ≤ eε · Pr[M(v′) = v∗].

2.3 Problem Definition

This paper studies the problem of federated SGD with LDP. Note that in thepractical non-private setting, the original gradient gt ← 5L(wt−1;x, y) can besparsified [21] or quantized [22] before a transmission. For generality, we do notlimit the form of local gradient and use vi to denote the local gradient calculatedfrom client ui’s record (xi, yi) and the global parameters wt−1.

Suppose the global model iterates for E epochs with a learning rate α and atotal privacy budget ε for each client. For a single epoch, clients are partitionedinto batches with size m. Then the privatized mechanism M privatizes m localupdates before they are aggregated by the untrusted server for one iteration. Theglobal model is updated as: wt ← wt−1−α 1

m

∑mi=1M(vi). We aim to propose an

ε-LDP framework with a specialized mechanismM for private federated trainingagainst the untrusted server. Moreover, we attempt to mitigate the dimensiondependency problem for a higher accuracy.

3 Two-stage LDP Framework: FedSel

In this section, we propose a two-stage framework FedSel with dimension selec-tion and value perturbation as shown in Fig.1. The framework and differencesfrom existing works are presented in Sect.3.1. We prove the privacy guaranteein Sect.3.2 and analyze the stability improvement in Sect.3.3.

3.1 Overview

We now illustrate the proposed framework in Algorithm 1 and compare it withthe flat solutions [12–14] and the compressed solution [9]. In our framework, theserver first initiates the ratio µ ∈ [0, 1] for privacy budget allocation and startsthe iteration. For the procedure on a local device: (i) Current gradient gt isaccumulated with previously delayed gradients rt−1 (line 15). (ii) An importantdimension index is privately selected by Dimension Selection (line 16). (iii)The value of the selected dimension plus its momentum (line 17) is perturbed byValue Perturbation (line 18). The accumulation in step(i) and the momentumin step(iii) derive from the work of Sun et al. [19] to compress local gradientwith little loss of accuracy and memory cost. We adapt the accumulation in


Pull

Local data

Calculate Gradients with local data

Push noisy vector

Update global parameters

…

parameters

Average gradient

Server

User

Select Top-K dimensions privately

Perturbthe selected value

Update the local accumulated vector

Dimension Selection

0 0 0 0 0 0 0

0 0 0 0 0 0 0

ValuePerturbation

Hyper-parameters-free

1

Fig. 1. Two-stage LDP framework.

Algorithm 1 Two-stage LDP framework of federated learning

E,N,m, µ, α, ε for the server; ε1, ε2, η, k for N clients;

GlobalUpdate:1: initialize t = 1, w0, r0i ← 0d for i ∈ [N ]2: µ← HyperParametersFree(m, ε, d)3: initialize ε′ = ε/E, ε1 = µ · ε′, ε2 = ε′ − ε14: for each epoch 1, · · · , E do5: for each sample batch with size m do6: initialize G as an empty set7: for each client do8: s∗ ← LocalUpdate(wt−1, ε1, ε2, η, k)9: add s∗ to G

10: s← 1m·∑s∗∈G s

∗ # aggregation

11: wt ← wt−1 − αs, t = t+ 112: return final global model wLocalUpdate: (wt−1, ε1, ε2, η, k)

13: initialize s∗ ← 0d14: gt ←5L(wt−1;x, y)15: rt ← rt−1 + gt # adapted accumulation16: j ← SelectOracle(rt, k, ε1)17: sj ← rtj + ηrt−1

j

18: s∗j ← ValuePerturbation(sj , ε2), rtj ← 019: return s∗

our framework to stabilize the iteration with noises. The step(ii) is designed toalleviate the dimensional bottleneck and we analyze the improvement of accuracyin Sect.4.2.

6 R. Liu et al.

Comparison with existing works. The most significant difference from exist-ing works is the way of deciding which dimension to upload. In the flat solution,each client randomly samples and perturbs c dimensions from d. the perturbedvalue is enlarged by d/c for an unbiased mean estimation. Thus, the injectednoise is also amplified. For the compressed solution, d dimensions of the gra-dient g is reduced to q dimensions by multiplying the vector g with a publicrandom matrix Φd×q drawn from the Gaussian distribution with mean 0 andvariance 1/q. Then an index is randomly sampled from q dimensions with itsvalue perturbed. The estimated mean of compressed vector is then approximatelyrecovered with the pseudo-inverse of Φ. Even if the random projection has thestrict isometry property, it ignores the meaning of the gradient magnitude andbrings recovery error. Our framework differs the compressed solution because weutilize the magnitude property for gradient value.

3.2 Privacy Guarantee

In Algorithm 1, both the local accumulated vector r and the vector with mo-mentum s are true local vectors that are directly calculated from the privatedata. So we abuse v ∈ Rd to denote them in the following statement. As previ-ously defined by Shokri et al. [23], there are two sources of information that weintend to preserve for the local vector: (i) how a dimension is selected and (ii)the value of the selected dimension. Let z = 0, 1d indicate the ground-truthTop-k status. We decompose the protection goal into following privacy defini-tions for dimension selection and value perturbation. When combining the twostages together, we can provide an ε-LDP guarantee with Theorem 1.

Definition 2 (LDP Dimension Selection). A randomized dimension selec-tion algorithm M1 satisfies ε1−LDP if and only if for any two status vectorsz, z′ ∈ 0, 1d and any output j ∈ [d]: Pr[M1(z) = j] ≤ eε1 · Pr[M1(z′) = j].

Definition 3 (LDP Value Perturbation). A randomized value perturbationalgorithm M2 satisfies ε2−LDP if and only if for any two numeric values vj , v

′j

and any output v∗j : Pr[M2(vj) = v∗j ] ≤ eε2 · Pr[M2(v′j) = v∗j ].

Theorem 1. For a true local vector v, if the two-stage mechanism M first se-lects a dimension index j with M1 and then perturbs vj with M2 under ε1-LDPand ε2-LDP respectively, M satisfies (ε1 + ε2)-LDP.

proof. Theorem 1 stands only when for any two possible local vectors v, v′, theconditional probabilities for M to give the same output v∗ satisfy the followingcondition: Pr[v∗|v] ≤ eε1+ε2Pr[v∗|v′]. Let z, z′ denote selection status vectorsof v, v′. As we are considering the case where v, v′ have the same output v∗, weend the proof with:

Pr[v∗|v]

Pr[v∗|v′]=

Pr[z|v]Pr[j|z]Pr[v∗j |vj ]Pr[z′|v′]Pr[j|z′]Pr[v∗j |v′j ]

≤ Pr[z|v]

Pr[z′|v′]· eε1eε2 = e(ε1+ε2).


3.3 Variance Analysis for Accumulation

In the existing non-private gradients accumulation [19,21,24,25], local vectors areaccumulated as rt = rt−1+αgt. We adapt it to rt = rt−1+gt and scale it with thelearning rate in line 11. This aims to reduce the variance of noisy local updatesand stabilize the iteration. Suppose the jth dimension of a gradient is selectedafter T rounds of delay, and si,j denotes the gradient value. In each iteration ofAlgorithm 1, the update for the global parameter wj from user ui is denoted as4Twi,j = α

ms∗i,j . If αg is accumulated [19], the update is 4′Twi,j = 1

m (αsi,j)∗.

For LDP mechanisms of mean estimation, the input range is [−1, 1]. Thusall inputs for a perturbation mechanism should be clipped to the defined inputrange. When the value for the selected dimension is in the input range, we caneasily have: V ar[4Twi,j ] = V ar[4′Twi,j ]. It should be noted that our selectedgradient value has significantly larger magnitude. When initial values of bothcases are clipped to the input domain, we define the clipped input as ξ. Then

we have: V ar[4Twi,j ] = α2

m2V ar[ξ∗], V ar[4′Twi,j ] = 1

m2V ar[ξ∗]. As the learning

rate α is typically smaller than 1, such as 0.1, we then have Theorem 2. Therefore,the slight adaption of scaling procedure can reduce the variance of single client’snoisy update. The advance of a smaller variance will lead to a more accuratelearning performance as validated in the experimental part, Fig.3(a).

Theorem 2. Accumulating g instead of αg for the dimension j of user ui hasan update on the global model’s parameter with a less variance:

V ar[4Twi,j ] ≤ V ar[4′Twi,j ], where α ∈ (0, 1).

4 Private Dimension Selection

To instantiate the ε1-LDP dimension selection in Algorithm 1 line 16, we designLDP mechanisms from Top-1 to Top-k case. Then, we analyze the accuracy andtime complexity of proposed LDP mechanisms.

4.1 Selection Mechanisms

Exponential Mechanism (EXP): Exponential mechanism [18] is a naturalbuilding block for selecting private non-numeric data in the centralized differen-tial privacy. We modify this classic method to meet the selection requirement.A client’s accumulated vector r is first sorted in ascending order of its absolutevalue. As a special case for Top-1 selection, the status vector in EXP is definedas: z = 1, · · · , dd instead of a binary vector. Intuitively, the dimension with thelargest magnitude of its absolute value should be output with the highest prob-ability. For the jth dimension, we assign the selection status as its ranking zj .

Thus, the index j ∈ [d] is sampled unevenly with the probabilityexp(

ε1zjd−1 )∑d

i=1 exp(ε1zid−1 )

.

The privacy guarantee is shown in Lemma 1.

Lemma 1. EXP selection is ε1−locally differentially private.

8 R. Liu et al.

proof. Given any two possible ranking vectors as z, z′ ∈ 1, · · · , dd. j denotesany output index of EXP. The following conditional probability ends the proof:

Pr[j|z]Pr[j|z′]

=exp(

ε1zjd−1 )∑d

i=1 exp( ε1zid−1 )/

exp(ε1z′j

d−1 )∑di=1 exp(

ε1z′id−1 )

≤exp( ε1·dd−1 )

exp( ε1·1d−1 )= e

ε1(d−1)d−1 = eε1 .

In order to fit various learning tasks, k should be tunable. Thus, we proposetwo private Top-k methods to better control the selection.

Perturbed Encoding Mechanism (PE): The sorting step for the vector ofabsolute values |r| in PE is the same as in EXP. Besides, a binary Top-k statusvector z is derived. Then we perturb the vector z with the randomized response.Specifically, each status has a large probability p to retain its value and a smallprobability 1− p to flip. For the privacy guarantee, p = eε1

eε1+1 . Let z denote theprivatized status vector. Since indices of non-zero elements in z are more likelyto be Top-k dimensions, we gather these elements as the sample set S. If S isempty, the client uploads ⊥ and the server regards it as receiving a zero vectors∗ = 0d. Elsewise the client randomly samples one dimension index from S.The privacy guarantee is shown in Lemma 2.

Lemma 2. PE selection is ε1−locally differentially private.

proof. z denotes the perturbed status vector. The expected sparsity of z is:

l =E[||z||0] =

d∑j=1

Pr[zj = 1|zj ] =

k∑j=1

Pr[zj = 1|zj = 1] +

d∑j=k+1

Pr[zj = 1|zj = 0]

=k · p+ (d− k) · (1− p), where p =eε1

eε1 + 1.

Given any two possible selection status vectors as z, z′ ∈ 0, 1d with k non-zeroelements, there are two cases for the output j:(i) If the sample set S is not empty,j ∈ S. (ii) If the sample set S is empty, j = ⊥. For the first case, we have:

Pr[j|z]Pr[j|z′]

=1l Pr[zj = 1|z]1l Pr[z

′j = 1|z′]

≤1l Pr[zj = 1|zj = 1]1l Pr[z

′j = 1|z′j = 0]

=eε1

eε1 + 1/

1

eε1 + 1= eε1 .

For the second case, we end the proof with the conditional probability:

Pr[j = ⊥|z]Pr[j = ⊥|z′]

=(1− p)k · pd−k

(1− p)k · pd−k= 1 ≤eε1 (for ε1 ≥ 0)

Perturbed Sampling Mechanism (PS): PS selection has the same criterionas PE that regards Top-k as important dimensions. Intuitively, we define a higherprobability p to sample an index j from the Top-k indices set j ∈ [d]|zj = 1and elsewise, sample an index j from non-Top-k dimensions j ∈ [d]|zj = 0with a smaller probability 1 − p. With the privacy guarantee in Lemma 3, wedefine p = eε1 ·k

d−k+eε1 ·k .

Lemma 3. PS selection is ε1−locally differentially private.


proof. Given any two possible Top-k status vector z, z′ and any output indexj ∈ 1, · · · , d, the following conditional probability ends the proof:

Pr[j|z]Pr[j|z′]

≤ Pr[j|zj = 1]

Pr[j|z′j = 0]=

p 1k

(1− p) 1d−k

= eε1 ,where p =eε1k

d− k + eε1 · k.

4.2 Analyses of Accuracy and Time Complexity

We analyze the accuracy improvement of the proposed two-stage framework byevaluating the error bound in Theorem 3 which stands independently of valueperturbation algorithms in the second stage. The amount of noise in the average

vector is O(√log d

ε2√md

) and the acceptable batch size is |m| = Ω( log ddε22

) which does

not increase linearly with d. Since ε2 = ε′(1 − µ) and ε′ = ε/E, it is evidentthat Ω(E2 log d/dε2) < Ω(E2d log d/ε2) for µ < 0.5. Hence, we can improve theaccuracy while keeping the same private guarante. It also reminds us to allocatea small portion of privacy budget to the dimension selection.

Theorem 3. For any j ∈ [d], let s = 1m

∑s∗∈G s

∗. X = 1m

∑s∗∈G s denotes the

mean of true sparse vectors without perturbations. With 1− β probability,

maxj∈[1,d]

|sj −Xj | = O(

√log d/β

ε2√md

).

Compared with the non-private setting, LDP brings extra computation costsfor local devices. For EXP, different utility scores and the summation can be ini-tialized offline. Sorting a d−dimensional vector consumes O(d log d). Mapping alld dimensions to according utility values consumes O(d2) and sampling requiresO(d). Thus, each local device has extra time cost O(d log d+d2 +d) = O(d2) forEXP. With a similar analysis, the extra time cost for PE is O(d log d+ d+ l) =O(d log d) which is less than the time complexity O(d2) of EXP. Since PS avoidsthe perturbation for each dimension, it has a slightly less computation cost thanPE with the magnitude of O(d log d). We validate this conclusion in experiments.

5 Experiments

In this section, we assess the performance of our proposed framework on real-world and synthetic datasets. We first evaluate our selection methods withoutthe second stage of value perturbation. To evaluate the improvement of reducinginjected noises, we compare the learning performance with the state-of-the-artworks [9, 13, 14] and validate theoretical conclusions in Sects.3.3 and 4.2. More-over, we implement a hyper-parameters-free strategy that automatically initiatesthe budget allocation ratio µ to fit scenarios with dynamic population sizes.

5.1 Experimental Setup

Datasets and Benchmarks. For the convenience to control data sparsity, thesynthetic data is generated with the existing procedure [26] with two param-eters C1 = 0.01, 0.1, C2 = 0.6, 0.9 and dimensions 100(syn-L), 300(syn-H) over 60,000, 100,000 records. The over real-world benchmark datasets

10 R. Liu et al.

Table 2. Frameworks and Variants for Comparisons.

solution abbreviation sparsification perturbation budget

non-private NP full/random/topk - ∞flat [13,14] PM/HM/Duchi random sampling ε′ ε′

compressed [9] -RP random projection ε′ ε′

two-stage EXP/PE/PS- ε1 = µ · ε′ ε2 = ε′ − ε1 ε′

includes BANK, ADULT, KDD99 which have 32, 123, 114 dimensions over45,211, 48,842, 70,000 records. We follow a typical pre-process procedure inmachine learning with one-hot-encoding every categorical attribute. We test onl2-regularized Logistic Regression and Support Vector Machine.

Choices of Parameters. Since we observe that models on the above datasetscan converge within 100 rounds of iterations, we set the batch size for one globalmodel’s iteration as m = 0.01 ·N . We report the average accuracy or misclassi-fication rates of 10 times 5-folds cross-validations for one epoch unless otherwisestated. The discounting factor η and learning rate α are same in each case for afair comparison. We set k = 0.1d, µ = 0.1, λ = 0.0001 by default.

Comparisons with Competitors. The proposed FedSel framework is prefixedwith selection mechanisms EXP/PE/PS. We compare it with non-private base-lines (NP) of three different transmitting methods (-/RS/K): full gradient, ran-dom sampled dimension, Top-k(k = 1) selection. We also compare with the flatsolution [13,14] and the compressed solution [9] with random sampling and ran-dom projection respectively before perturbing the value. Due to limited spaceand variant baselines, we mainly demonstrate comparisons with the optimalcompetitor PM and show comparisons with other competitors in Fig.2(i) andFig.2(h). Abbreviations of different variants are listed in Table 2.

5.2 Evaluation of the Dimension Selection

Convergence and Accuracy. We compare EXP/PE/PS with non-privatebaselines of NP, NP-RS and NP-K by visualizing the misclassification rate andaccuracy of test set in Fig.2(a) to Fig.2(d). Note that this comparison only fo-cuses on selection without the second stage of value perturbation. For NP-RS,we enlarge the value randomly sampled by each user for an unbiased estimation.This follows the same principle in the flat or compressed solution.

The advance of NP-K compared with NP-RS shows our essential motiva-tion that Top-k is a more effective and accurate way to reduce the transmitteddimension. On 100-dimensional dataset, NP-K even approaches the full-gradient-uploading baseline NP in Fig.2(a). Besides, EXP/PE/PS converge more stableand faster than NP-RS in Fig.2(a) and Fig.2(b) with ε1 = 4. With a larger pri-vacy budget, there is a trend for EXP/PE/PS to approach the same accuracyperformance as the NP-K(k = 1) in Fig.2(c) and Fig.2(d). Moreover, even asmall budget in dimension selection helps to increase the learning accuracy. Wecan also observe that PE and PS methods which intuitively intend to select from


0 20 40 60 80 1000.0

0.1

0.2

0.3

0.4

0.5

(a)

0 20 40 60 80 100

0.1

0.2

0.3

0.4

0.5

(b)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.01

90

92

94

96

98

100

test

acc

urac

y

NPNP_RSNP_KEXPPEPS

(c)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.01

40

50

60

70

80

90

100

test

acc

urac

y

NPNP_RSNP_KEXPPEPS

(d)

0 2000 4000 6000 8000 10000

10 5

10 4

10 3

10 2

10 1

time/s

dimensions

(e)

0.1 0.2 0.5 1.0 1.5 2.0'

60

65

70

75

80

85

90

test

acc

urac

y

(f)

0.5 1.0 1.5 2.0 4.0'

75

80

85

90

95

100

test

acc

urac

y

logistic regression KDD(E=2)

NPPMEXP-PMEXP-PM-CDE-PMDE-PM-CBIN-PMBIN-PM-C

(g)

0.5 1.0 1.5 2.0 4.0 8.0'

0.20

0.22

0.24

0.26

0.28

0.30

0.32

0.34

0.36

(h)

1 2 4 8 16'

0.24

0.26

0.28

0.30

0.32

0.34

0.36

0.38

0.40

misc

lass

ifica

tion

rate

logistic regression syn-H(c1 = 0.01 c2 = 0.9)

PMPM-RPEXP-PMPE-PMPS-PM

(i)

Fig. 2. Improvements of the two-stage framework and dimension selection.

the Top-k list have a better performance than EXP. Thus, our extension fromTop-1 to Top-k is necessary.

Validation of Complexity Analysis. Here we analyze the time consumptionfor each client per transmission in Fig.2(e). The time is counted by iteratingover synthetic datasets with variant dimensions from 10 to 10,000. We observethat the selection stage indeed incurs extra computation cost. Consistent withprevious analysis in Sect.4.2, PS has the lowest computation cost.

5.3 Effectiveness of the Two-stage Framework

Comparison with Existing Solutions. We compare the learning performancefor EXP/PE/PS-PM with the flat solution PM/HM/Duchi [13,14] and the com-pressed solution [9]. Remark that we set control groups with postfix ”-C” inFig.2(f) and Fig.2(g) by allocating ε1 for dimension selection and ε′ for valueperturbation. To further elucidate the trade-off between what we gain and what

12 R. Liu et al.

Table 3. Gains and Losses on Accuracy for Private Selection with ε = 2 (%).

dataset model EXP-gain EXP-loss PE-gain PE-loss PS-gain PS-loss

syn-L-0.01-0.9 logistic 8.6074 0.3517 5.410 1.192 5.975 0.4970

syn-L-0.01-0.9 SVM 7.1950 2.1593 3.7704 0.8533 5.065 2.0816

BANK logistic 2.4197 -0.157 3.2338 0.0464 2.5525 0.1463

BANK SVM 4.3823 0.4436 3.4369 0.2530 4.0244 0.0164

KDD logistic 2.0471 0.5091 2.5148 0.2322 2.0171 0.3428

KDD SVM 1.85629 -0.1625 2.2168 0.2288 1.8291 0.4465

ADULT logistic 5.5745 0.2935 5.6445 1.3096 6.0535 0.8091

ADULT SVM 5.5361 0.1949 5.6057 0.9550 5.1442 0.3852

we lose, we qualify the benefit of private selection with the gap between theaccuracy as the following, which is shown as Table 3. It is evident that what wegain is much larger than what we lose. This is because when we have enoughprivacy budget for value perturbation, increasing budget for value perturbationis not comparable to allocating surplus budget to privacy selection.

gain = acc(EXP/PE/PS-PM-C)− acc(PM),

loss = acc(EXP/PE/PS-PM-C)− acc(EXP/PE/PS-PM).

From Fig.2(f) and Fig.2(g), we observe that proposed two-stage solutionshave higher test accuracy than the optimal private baseline PM on both mod-els and all datasets. Given enough privacy budget on relatively low-dimensionaldatasets, proposed solutions even outperform the non-private baseline in Fig.2(f).The key to this success is the inherent randomness in SGD. The slightly intro-duced stochasticity for privacy-preserving prevents the overfitting problem. InFig.2(g) with results of two epochs, our adapted local accumulation with mo-mentum helps to reduce the impact of noisy gradients compared with the privatebaseline PM, especially when ε′ is small.

From Fig.2(h), we show a comparison with flat solutions of other pertur-bation methods which have the same optimal error bound as PM. Note thatthe value perturbation algorithms for each pair of comparison in Fig.2(h) is thesame and only differ in the selection stage. It is evident that our framework hasa lower misclassification rate and standard deviation over 50 times tests. There-fore, we can conclude that this improvement is independent of value perturbationmethods. We omit the comparison of EXP and PE for the same conclusion.

In Fig.2(i), we compare with the compressed solution [9] with the samecomparison ratio 0.1. The originally apply random projection in gradients ofthe Matrix Factorization and use another value perturbation method [12]. Sincethe dimension reduction idea is independent of value perturbation, for fairnesscomparison, we adopt the random projection idea and use the same methodPM [13] to perturb value when implementing the competitor PM-RP. Our resultshows that, even the error bound is reduced by random projection, the recoveryerror ruins the accuracy while our dimension selection still works.


0.5 1.0 1.5 2.0 4.0 8.0'

0.0

0.1

0.2

0.3

0.4

(a)

0.5 1.0 1.5 2.0 4.0 8.0'

0.15

0.20

0.25

0.30

0.35

0.40

0.45

(b)

100.0 200.0 300.0 400.0 500.0 800.0 1000.0batch size

0.050

0.075

0.100

0.125

0.150

0.175

0.200

0.225

misc

lass

ifica

tion

rate

SVM KDD(hp-free, E=2, ′ = 1.5)NPPMEXP-PM(hp-free)PE-PM(hp-free)PS-PM(hp-free)

(c)

Fig. 3. Improvement of the adapted accumulation and impacts of µ

Effectiveness of the Adapted Accumulation. We validate the improvement ofstability consistent with our analysis in Sect.3.3 in Fig.3(a). We set the variantof accumulating αg as the competitor (EXP/PE/PS-PM-O). The result showsaccumulating αg directly will not improve the learning performance becausewhat we gain by selection is offset by the larger turbulence. Thus, our adaptationis necessary for better compatibility with the private context.

Impacts of µ. As µ ∈ [0, 1] controls the privacy budget allocation in ourtwo-stage framework, we evaluate µ in Fig.3(b) with ADULT dataset while thesame trends are found in other datasets. From Fig.3(b), we observe that ourframework with a small µ works no worse than the flat competitor, even whenthe total budget is small. In addition, µ = 0.1 gives an optimal learning accuracy,and µ = 0.8 leads to a worse performance with a higher misclassification rate anda significant standard deviation as expected. Thus, µ is an essential parameterthat controls the trade-off between what we gain and what we lose.

The divergence of a large µ reminds us that a safe maximum threshold θis required to guarantee this trade-off always benefits the model’s accuracy. Itis much easier to tune θ than µ as θ can be tested on synthetic datasets inde-pendently of total privacy budget and batch size. At the beginning of training,given the privacy budget per epoch ε′, the model’s dimension d, the availablebatch size m, our principle is to first allocate at least ε2 = Ω(

√d log d/m) to the

second stage. Then extra privacy budget can be allocated for dimension selectionto improve the accuracy as a bonus.

In Fig.3(c), we set a safe threshold as θ = 0.2 empirically and validatethe effectiveness of the proposed hyper-parameters-free strategy. Different batchsizes shown in the x-axis simulate the dynamic amount of participants wheninitiating a distributed learning task in practice. For the fairness to compare thetest set accuracy with different batch sizes, we stop the learning process withinthe same number of iterations. We observe that the proposed solution withthree dimension selection methods under this strategy significantly improve themodel’s accuracy. Besides, the proposed hyper-parameters-free strategy workssteadily for dynamic batch sizes as the deviations among all 50 times tests aresmaller than the private baseline.

14 R. Liu et al.

6 Related Works

How we select the dimension and accumulate gradients are based on the well-studied gradient sparsification in the non-private setting. Strom et al. [27] pro-pose only to upload dimensions with absolute values larger than a threshold.Instead of a fixed threshold, Aji et al. [25] introduce Gradient Dropping (GD)which sorts absolute values first and dropping a fixed portion of gradients. Wanget al. [26] drop gradients by trading off between the sparsity and variance. Al-istarh et al. [21] show the theoretical convergence for Top-k selection. However,even if the gradient update is compressed, there still exist privacy risks becauseit is calculated directly with local data.

If local gradients are transmitted in clear, the untrusted server threatensclients’ privacy. Nasr et al. [6] present the membership inference by only observ-ing uploads or controlling the view of each participant. Wang et al. [8] propose areconstruction attack in which the server can recover a specific user’s local datawith Generative adversarial nets (GANs). Secure attack [28] in FL is also animportant topic but we focus on private issues in this paper.

Cryptography technologies face a bottleneck of heavy communication andcomputation costs. Bonawits et al. [29] present an implementation of Secure Ag-gregation, which entails four rounds interacts per iteration and several costs growquadratically with the number of users. As for differential privacy(DP) [18] indistributed SGD, Shokri et al. [23] propose a asynchronous cooperative learningwith privately selective SGD by sparse vector technique. It may lose accuracyas it drops delayed gradients instead of accumulating as our works. Agarwal etal. [30] combine gradient quantization and differential private mechanisms in syn-chronous setting, but it requires a higher communication cost for d-dimensionalvector. It should be noticed that the privacy definition in the above works differ-entiate from LDP as it provides the plausible deniability for only single gradientvalue while LDP guarantees the whole gradient vector to be indistinguishable.

Many LDP techniques are proposed for categorical or numeric values. Ran-domized response (RR) method [31] is the classic method to perturb binaryvariables. Kairouz et al. [32] introduce a family of extremal privatization mech-anisms k-RR to categorical attributes. With LDP mechanisms for mean estima-tion, Duchi et al. [14] suggest that LDP leads to an effective degradation in batchsize. Wang et al. [13] show that compared with Duchi et al.’s work, their meanestimation mechanisms with lower worst-case variance lead to a lower misclassi-fication rate when applied in SGD. Considering the required batch size is linearlydependent on the dimension, Bhowmick et al. [33] design LDP mechanisms forreconstruction attack with a large magnitude of privacy budget to get rid of theutility limitation of a normal locally differentially private learning.

7 Conclusions

This paper proposes a two-stage LDP privatization framework FedSel for fed-erated SGD. The key idea takes the first attempt to mitigate the dimension


problem in injected noises by delaying unimportant gradients. We further stabi-lize the global iteration by modifying the accumulation with a smaller varianceon the noisy update. The improvement of proposed methods is theoretically an-alyzed and validated in experiments. The framework with hyper-parameters-freealso outperforms baselines over variant batch sizes. In future work, we plan toformalize the optimal trade-off for utility and accuracy and extend FedSel to amore general case.

Acknowledgements. This work is supported by the National Key Research andDevelopment Program of China (No. 2018YFB1004401), National Natural Sci-ence Foundation of China (No. 61532021, 61772537, 61772536, 61702522), JSPSKAKENHI Grant No. 17H06099, 18H04093, 19K20269, and Microsoft ResearchAsia (CORE16).

References

1. McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B. A.:Communication-Efficient Learning of Deep Networks from Decentralized Data. In:Artificial Intelligence and Statistics, pp. 1273-1282. (2017)

2. Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, V., VanOverveldt, T.: Towards federated learning at scale: System design. arXiv preprintarXiv:1902.01046. (2019)

3. Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: Concept andapplications. ACM Transactions on Intelligent Systems and Technology (TIST),10(2), 1-19. (2019)

4. McMahan, H. B., Moore, E., Ramage, D., y Arcas, B. A.: Federated learningof deep networks using model averaging. CoRR abs/1602.05629. arXiv preprintarXiv:1602.05629. (2016)

5. Zhu, L., Liu, Z., Han, S.: Deep leakage from gradients. In: NeurIPS, pp. 14747-14756.(2019)

6. Nasr, M., Shokri, R., Houmansadr, A. Comprehensive Privacy Analysis of DeepLearning. In: IEEE SP. (2019)

7. Fredrikson, M., Jha, S., Ristenpart, T.: Model inversion attacks that exploit con-fidence information and basic countermeasures. In: SIGSAC CCS, pp. 1322-1333.(2015)

8. Wang, Z., Song, M., Zhang, Z., Song, Y., Wang, Q., Qi, H.: Beyond inferring classrepresentatives: User-level privacy leakage from federated learning. In IEEE INFO-COM, pp. 2512-2520. (2019)

9. Shin, H., Kim, S., Shin, J., Xiao, X.: Privacy enhanced matrix factorization for rec-ommendation with local differential privacy. IEEE TKDE, 30(9), 1770-1782. (2018)

10. Gu, X., Li, M., Cheng, Y., Xiong, L., Cao, Y.: PCKV: Locally Differentially PrivateCorrelated Key-Value Data Collection with Optimized Utility. In: USENIX SecuritySymposium. (2020).

11. Ye, Q., Hu, H., Meng, X., Zheng, H.: PrivKV: Key-value data collection with localdifferential privacy. In: IEEE SP, pp. 317-331. (2019)

12. Nguyn, T. T., Xiao, X., Yang, Y., Hui, S. C., Shin, H., Shin, J. (2016). Collectingand analyzing data from smart device users with local differential privacy. arXivpreprint arXiv:1606.05053. (2016)

http://arxiv.org/abs/1902.01046



16 R. Liu et al.

13. Wang, N., Xiao, X., Yang, Y., Zhao, J., Hui, S. C., Shin, H., Yu, G.: Collecting andanalyzing multidimensional data with local differential privacy. In: IEEE ICDE, pp.638-649. (2019)

14. Duchi, J. C., Jordan, M. I., Wainwright, M. J.: Minimax optimal procedures forlocally private estimation. Journal of the American Statistical Association, 113(521),182-201. (2018)

15. Gu, X., Li, M., Cao, Y., Xiong, L.: Supporting both range queries and frequencyestimation with local differential privacy. In: IEEE Conference on Communicationsand Network Security (CNS), pp. 124-132. (2019)

16. Gu, X., Li, M., Xiong, L., Cao, Y.: Providing input-discriminative protection forlocal differential privacy. In: IEEE ICDE. (2020)

17. Johnson, W. B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbertspace. Contemporary mathematics, 26(189-206), 1. (1984)

18. Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Founda-tions and Trends in Theoretical Computer Science, 9(34), 211-407. (2014)

19. Sun, H., Shao, Y., Jiang, J., Cui, B., Lei, K., Xu, Y., Wang, J.: Sparse gradientcompression for distributed SGD. In: DASFAA, pp. 139-155. Springer. (2019)

20. Duchi, J. C., Jordan, M. I., Wainwright, M. J.: Local privacy and statistical mini-max rates. In: Annual Symposium on Foundations of Computer Science, pp. 429-438,IEEE. (2013)

21. Alistarh, D., Hoefler, T., Johansson, M., Konstantinov, N., Khirirat, S., Renggli,C.: The convergence of sparsified gradient methods. In: NeurIPS, pp. 5973-5983.(2018)

22. Alistarh, D., Hoefler, T., Johansson, M., Konstantinov, N., Khirirat, S., Renggli,C.: The convergence of sparsified gradient methods. In NeurIPS, pp. 5973-5983.(2018)

23. Shokri, R., Shmatikov, V.: Privacy-preserving deep learning. In: SIGSAC CCS, pp.1310-1321, ACM. (2015)

24. Lin, Y., Han, S., Mao, H., Wang, Y., Dally, W. J.: Deep gradient compression:Reducing the communication bandwidth for distributed training. In ICLR. (2018)

25. Aji, A. F., Heafield, K.: Sparse communication for distributed gradient descent.In: EMNLP, pages 440–445. (2017)

26. Wangni, J., Wang, J., Liu, J., Zhang, T.: Gradient sparsification forcommunication-efficient distributed optimization. In: NeurIPS, pp. 1299-1309.(2018)

27. Strom, N.: Scalable distributed DNN training using commodity GPU cloud com-puting. In: INTERSPEECH. (2015)

28. Fang, M., Cao, X., Jia, J., Gong, N. Z.: Local model poisoning attacks to Byzantine-robust federated learning. In: USENIX Security Symposium. (2020)

29. Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H. B., Patel, S.,Seth, K.: In: SIGSAC CCS, pp. 1175-1191, ACM. (2017)

30. Agarwal, N., Suresh, A. T., Yu, F. X. X., Kumar, S., McMahan, B.: cpSGD:Communication-efficient and differentially-private distributed SGD. In: NeurIPS,pp. 7564-7575. (2018)

31. Warner, S. L.: Randomized response: A survey technique for eliminating evasiveanswer bias. Journal of the American Statistical Association, 60(309), 63-69. (1965)

32. Kairouz, P., Oh, S., Viswanath, P.: Extremal mechanisms for local differentialprivacy. In: NeurIPS, pp. 2879-2887. (2014)

33. Bhowmick, A., Duchi, J., Freudiger, J., Kapoor, G., Rogers, R.: Protection againstreconstruction and its applications in private federated learning. arXiv preprintarXiv:1812.00984. (2018)



1.A Analysis of Accuracy

proof. for Theorem 3: For any j ∈ [1, d], the estimated mean of perturbedsparse vectors is s = 1

m

∑s∗∈G s

∗. The mean of the sparse vectors without value

perturbation is denoted as X = 1m

∑s∗∈G s. For any i ∈ [1,m], perturbation

method is unbiased, s∗i,j − si,j has zero mean. Given privacy budget ε2 for valueperturbation with the asymptotically optimal error bound, we have the upperbound for |s∗i,j − si,j | as O(1/ε2). Then we have:

Pr[|sj −Xj | ≥ λ]

=Pr[|m∑i

s∗i,j − si,j| ≥ mλ]

≤2 · exp (− (mλ)2

2∑mi V ar[s

∗i,j ] + 2

3mλ ·O(1/ε2)) (1)

Let pi,j denote the probabilities for dimension j of user ui being selected bya selection mechanism. ζ presents the output value of the value perturbationalgorithm, and we have E[ζ2] = O(1/ε22). Thus, the expectation of the varianceis shown as follows:

V ar[s∗i,j ] =E[(s∗i,j)2]− E[s∗i,j ]

2

=pi,jE[ζ2]− (si,j)2. (2)

The probability of pi,j depends on whether j is in the Top-k list of ui’s localvector. For selection mechanisms EXP, PE, PS, we have the upper bound of pi,jdominated by O(1/d). Apply (2) to (1), we have:

Pr[|sj −Xj | ≥ λ] ≤ 2 · exp(− mλ2

O(1/dε22) + λ ·O(1/ε2)).

By the union bound, there exists λ = O(

√log d/β

ε2√md

) such that maxj∈[1,d] |sj−Xj | <λ holds with at least 1− β probability.

1.B Pseudocode

Algorithm 2 Exponential Private SelectOracle (EXP)

Require: rt, ε11: initialize sum←

∑dj=1 exp( ε1·j

d−1)

2: sort |rt| in ascending order3: for j ∈ [d] do4: zj ← the rank of dimension j

5: pj ←exp(

ε1·zjd−1

)

sum

6: sample an index j ∈ [d] with probability pj7: return j

18 R. Liu et al.

Algorithm 3 SelectOracle-Perturbed Encoding (PE)

Require: rt, k, ε11: initialize z ← 0d, p = eε1

eε1+1

2: sort |rt| and for j ∈ [d], set zj ← 1 if |rtj | ∈ Top-k3: for j ∈ [d] do4: if zj = 1 then

5: zj ←

1, w.p. p

0, w.p. 1− p6: else

7: zj ←

1, w.p. 1− p0, w.p. p

8: perturbed Top-k index list S = j|zj = 1, j ∈ [d]9: if S is empty then

10: return ⊥11: else12: sample a dimension j from S13: return j

Algorithm 4 SelectOracle-Perturbed Sampling (PS)

Require: rt, k, ε11: initialize z ← 0d2: sort |rt| and for j ∈ [d], set zj ← 1 if |rtj | ∈ Top-k3: sample x uniformly at random from [0, 1]4: if x < eε1 ·k

d−k+eε1 ·k then5: randomly sample an index j ∈ j ∈ [d]|zj = 16: else7: randomly sample an index j ∈ j ∈ [d]|zj = 08: return j

Documents

arXiv:2003.10637v1 [cs.LG] 24 Mar 2020 · 2 R. Liu et al. Table 1. Overview Comparison (the number of epochs E, the dimension of a gradient vector d, the compressed dimension q, the