Free-rider Attacks on Model Aggregation in Federated Learning

Free-rider Attacks on Model Aggregation in Federated Learning

Yann Fraboni1,2 Richard Vidal2 Marco Lorenzi11 Université Côte d’Azur, Inria Sophia Antipolis, Epione Research Group, France

and 2 Accenture Labs, Sophia Antipolis, France

Abstract

Free-rider attacks against federated learning con-sist in dissimulating participation to the federatedlearning process with the goal of obtaining thefinal aggregated model without actually contribut-ing with any data. This kind of attacks is critical insensitive applications of federated learning, wheredata is scarce and the model has high commer-cial value. We introduce here the first theoreticaland experimental analysis of free-rider attackson federated learning schemes based on iterativeparameters aggregation, such as FedAvg or Fed-Prox, and provide formal guarantees for these at-tacks to converge to the aggregated models of thefair participants. We first show that a straightfor-ward implementation of this attack can be simplyachieved by not updating the local parameters dur-ing the iterative federated optimization. As thisattack can be detected by adopting simple coun-termeasures at the server level, we subsequentlystudy more complex disguising schemes basedon stochastic updates of the free-rider parame-ters. We demonstrate the proposed strategies ona number of experimental scenarios, in both iidand non-iid settings. We conclude by providingrecommendations to avoid free-rider attacks inreal world applications of federated learning, es-pecially in sensitive domains where security ofdata and models is critical.

1 Introduction

Federated learning is a training paradigm that has gainedpopularity in the last years as it enables different clients tojointly learn a global model without sharing their respectivedata. It is particularly suited for Machine Learning appli-cations in domains where data security is critical, such as

Proceedings of the 24th International Conference on Artificial In-telligence and Statistics (AISTATS) 2021, San Diego, California,USA. PMLR: Volume 130. Copyright 2021 by the author(s).

healthcare [Brisimi et al., 2018, Silva et al., 2019]. The rel-evance of this approach is witnessed by current large scalefederated learning initiatives under development in the med-ical domain, for instance for learning predictive models ofbreast cancer1, or for drug discovery and development2.

The participation to this kind of research initiatives isusually exclusive and typical of applications where datais scarce and unique in its kind. In these settings, ag-gregation results entail critical information beyond dataitself, since a model trained on exclusive datasets mayhave very high commercial or intellectual value. For thisreason, providers may not be interested in sharing themodel: the commercialization of machine learning prod-ucts would rather imply the availability of the model asa service through web- or cloud-based API. This is dueto the need of preserving the intellectual property on themodel components, as well as to avoid potential informa-tion leakage, for example by limiting the maximum num-ber of queries allowed to the users [Carlini et al., 2019,Fredrikson et al., 2015, Ateniese et al., 2015].

This critical aspect can lead to the emergence of opportunis-tic behaviors in federated learning, where ill-intentionedclients may participate with the aim of obtaining the fed-erated model, without actually contributing with any dataduring the training process. In particular, the attacker, orfree-rider, aims at disguising its participation to federatedlearning while ensuring that the iterative training processultimately converges to the wished target: the aggregatedmodel of the fair participants. Free-riding attacks performedby ill-intentioned participants ultimately open federatedlearning initiatives to intellectual property loss and data pri-vacy breaches, taking place for example in the form of modelinversion [Fredrikson et al., 2014, Fredrikson et al., 2015].

The study of security and safety of federated learn-ing is an active research domain, and several kindof attacks are matter of ongoing studies. For exam-ple, an attacker may interfere during the iterativefederated learning procedure to degrade/modify mod-

1blogs.nvidia.com/blog/2020/04/15/federated-learning-mammogram-assessment/

2www.imi.europa.eu/projects-results/project-factsheets/melloddy

arX

iv:2

006.

1190

1v5

[cs

.LG

] 2

2 Fe

b 20

21

blogs.nvidia.com/blog/2020/04/15/federated-learning-mammogram-assessment/

blogs.nvidia.com/blog/2020/04/15/federated-learning-mammogram-assessment/

www.imi.europa.eu/projects-results/project-factsheets/melloddy

www.imi.europa.eu/projects-results/project-factsheets/melloddy


els performances [Bhagoji et al., 2019, Li et al., 2016,Yin et al., 2018, Xie et al., 2019, Shen et al., 2016],or retrieve information about other clients’ data[Wang et al., 2019, Hitaj et al., 2017]. Sincecurrently available defence methods such as[Fung et al., 2020, Bhagoji et al., 2019] are generallybased on outliers detection mechanisms, they are generallynot suitable to prevent free-riding, as this kind of attack isexplicitly conceived to stay undetected while not perturbingthe FL process. Free-riding may become a critical aspect offuture machine learning applications, as federated learningis rapidly emerging as the standard training scheme incurrent cooperative learning initiatives. To the best of ourknowledge, the only investigation is in a preliminary work[Lin et al., 2019] focusing on attack strategies operatedon federated learning based on gradient aggregation.However, no theoretical guarantees are provided for theeffectiveness of this kind of attacks. Furthermore thissetup is unpractical in many real world applications, wherefederated training schemes based on model averaging areinstead more common, due to the reduced data exchangeacross the network. FedAvg [McMahan et al., 2017] is themost representative framework of this kind, as it is based onthe iterative averaging of the clients models’ parameters,after updating each client model for a given number oftraining epochs at the local level. To improve the robustnessof FedAvg in non-iid and heterogeneous learning scenarios,FedProx [Li et al., 2018] extends FedAvg by including aregularization term penalizing local departures of clients’parameters from the global model.

The contribution of this work consists in the developmentof a theoretical framework for the study of free-rider at-tacks in federated learning schemes based on model aver-aging, such as in FedAvg and FedProx. The problem ishere formalized via the reformulation of federated learn-ing as a stochastic process describing the evolution ofthe aggregated parameters across iterations. To this end,we build upon previous works characterizing the evolu-tion of model parameters in Stochastic Gradient Descent(SGD) as a continuous time process [Mandt et al., 2017,Orvieto and Lucchi, 2018, Li et al., 2017, He et al., 2018].A critical requirement for opportunistic free-rider attacksis to ensure the convergence of the training process to thewished target represented by the aggregated model of thefair clients. We show that the proposed framework allowsto derive explicit conditions to guarantee the success of theattack. This is an important theoretical feature as it is ofprimary interest for the attacker to not interfere with thelearning process.

We first derive in Section 2.4 a basic free-riding strategyto guarantee the convergence of federated learning to themodel of the fair participants. This strategy simply consistsin returning at each iteration the received global parameters.As this behavior could easily be detected by the server,

we build more complex strategies to disguise the free-ridercontribution to the optimization process, based on opportunestochastic perturbations of the parameters. We demonstratein Section 2.5 that this strategy does not alter the globalmodel convergence, and in Section 3 we experimentallydemonstrate our theory on a number of learning scenariosin both iid and non-iid settings. All proofs and additionalmaterial are provided in the Appendix.

2 Methods

Before introducing in Section 2.2 the core idea of free-rider attacks, we first recapitulate in Section 2.1 the generalcontext of parameter aggregation in federated learning.

2.1 Federated learning through model aggregation:FedAvg and FedProx

In federated learning, we consider a set I of participat-ing clients respectively owning datasets Di composed ofMi samples. During optimization, it is generally as-sumed that the D elements of the clients’ parameters vec-tor θti = (θti,0, θ

ti,1, ..., θ

ti,D), and the global parameters

θt = (θt0, θt1, ..., θ

tD) are aggregated independently at each

iteration round t. Following this assumption, and for sim-plicity of notation, in what follows we restrict our analysisto a single parameter entry, that will be generally denotedby θti and θt for clients and server respectively.

In this setting, to estimate a global model across clients,FedAvg [McMahan et al., 2017] is an iterative training strat-egy based on the aggregation of local model parameters θti .At each iteration step t, the server sends the current globalmodel parameters θt to the clients. Each client updates themodel by minimizing over E epochs the local cost functionL(θt+1

i ,Di) initialized with θt, and subsequently returnsthe updated local parameters θt+1

i to the server. The globalmodel parameters θt+1 at the iteration step t + 1 are thenestimated as a weighted average:

θt+1 =∑i∈I

Mi

Nθt+1i , (1)

where N =∑i∈IMi represents the total num-

ber of samples across distributed datasets. Fed-Prox [Li et al., 2018] builds upon FedAvg by addingto the cost function a L2 regularization term penaliz-ing the deviation of the local parameters θt+1

i fromthe global parameters θt. The new cost functionis LProx(θt+1

i ,Di, θt) = L(θt+1i ,Di) + µ

2

∥∥θit+1 − θt∥∥2

where µ is the hyperparameter monitoring the regulariza-tion by enforcing proximity between local update θit+1 andreference model θt.

Yann Fraboni1,2, Richard Vidal2, Marco Lorenzi1

Algorithm 1: Free-riding in federated learning

Input: learning rate λ, epochs E, initial model θ0,batch size S

θ̃0 = θ0;for each round t=0,...,T-1 do

Send the global model θ̃t to all the clients;for each fair client j ∈ J do

θ̃t+1j = ClientUpdate(θ̃t, E, λ);

Send θ̃t+1j to the server;

for each free-rider k ∈ K doif disguised free-rider then

θ̃t+1k = θ̃t + ε, where ε ∼ N (0, σ2

k);else

θ̃t+1k = θ̃t

Send θ̃t+1k to the server;

θ̃t+1 =∑j∈J

Mj

N θ̃t+1j +

∑k∈K

Mk

N θ̃t+1k ;

2.2 Formalizing Free-rider attacks

Aiming at obtaining the aggregated model of the fair clients,the strategy of a free-rider consists in participating to fed-erated learning by dissimulating local updating through thesharing of opportune counterfeited parameters. The free-riding attacks investigated in this work are illustrated inAlgorithm 1, and analysed in the following sections fromboth theoretical and experimental standpoints.

We denote by J the set of fair clients, i.e. clients followingthe federated learning strategy of Section 2.1 and by Kthe set of free-riders, i.e. malicious clients pretending toparticipate to the learning process, such that I = J ∪K andJ 6= ∅. We denote by MK the number of samples declaredby the free-riders.

2.3 SGD perturbation of the fair clients local model

To describe the clients’ parameters observed during fed-erated learning, we rely on the modeling of StochasticGradient Descent (SGD) as a continuous time stochas-tic process [Mandt et al., 2017, Orvieto and Lucchi, 2018,Li et al., 2017, He et al., 2018].

For a client j, let us consider the following form for the lossfunction:

Lj(θj) =1

Mj

Mj∑n=1

ln,j(θj), (2)

where Mj is the number of samples owned by the client,and ln,j is the contribution to the overall loss from a singleobservation {xn,j ; yn,j}. The gradient of the loss functionis defined as gj(θj) ≡ ∇Lj(θj).

We represent SGD by considering a minibatch Sj,k, com-posed of a set of S different indices drawn uniformly atrandom from the set {1, ... ,Mj}, each of them indexing a

function ln,j(θj) and where k is the index of the minibatch.Based on Sj,k, we form a stochastic estimate of the loss,

LSj,k(θj) =1

S

∑n∈Sj,k

ln,j(θj), (3)

where the corresponding stochastic gradient is defined asgSj,k(θj) ≡ ∇LSj,k(θj).

By observing that gradient descent is a sum of S indepen-dent and uniformly distributed samples, thanks to the centrallimit theorem, gradients at the client level can thus be mod-eled by a Gaussian distribution

gSj,k(θj) ∼ N (gj(θj),1

Sσ2j (θj)), (4)

where gj(θj) = Es[gSj,k(θj)

]is the full gradient of the

loss function in equation (2) and σ2j (θj) is the variance

associated with the loss function in equation (3).

SGD updates are expressed as:

θj(uj + 1) = θj(uj)− λgSj,k(θj(uj)), (5)

where uj is the SGD iteration index and λ is the learningrate set by the server.

By defining ∆θj(uj) = θj(uj+1)−θj(uj), we can rewritethe update process as

∆θj(uj) = −λgj(θj(uj)) +λ√Sσj(θj)∆Wj , (6)

where ∆Wj ∼ N (0, 1). The resulting continuous-time model [Mandt et al., 2017, Orvieto and Lucchi, 2018,Li et al., 2017, He et al., 2018] is

dθj = −λgj(θj)duj +λ√Sσj(θj)dWj . (7)

where Wj is a continuous time Wiener Process.

Similarly as in [Mandt et al., 2017], we assume that σj(θj)is approximately constant with respect to θj for theclient’s stochastic gradient updates between t and t +1, and will therefore denote σj(θj) = σtj . Following[Mandt et al., 2017], we consider a local quadratic approx-imation for the client’s loss, leading to a linear formfor the gradient gj(θj) ' rj [θj − θ∗j ], where rj ∈R+ depends on the approximation of the cost functionaround the local minimum θ∗j . This assumption enablesrewriting equation (7) as an Ornstein-Uhlenbeck process[Uhlenbeck and Ornstein, 1930]. Starting from the initialcondition represented by θt, the global model received at theiteration t, we characterize the local updating of the parame-ters through equation (7), and we follow the evolution up tothe time EMj

S , where E is the number of epochs, and Mj isthe number of samples owned by the client. Assuming thatMj is a multiple of S, the number of samples per minibatch,


the quantity EMj

S represents the total number of SGD stepsrun by the client. The updated model θt+1

j uploaded to theserver therefore takes the form:

θt+1j = e−λrj

EMjS [θt − θ∗j ] + θ∗j︸︷︷︸

θ̂t+1j

+λ√S

∫ EMjS

u=0

e−λrj

(EMj

S −u)σtjdWu. (8)

We note that the relative number of SGD updates for thefair clients, EMj

S , influences the parameter ηj = e−λrjEMj

S ,which becomes negligible for large values of E.

The variance introduced by SGD can be rewritten as

Var[θt+1j |θt

]=λ

Sσtj

2 1

2rj

[1− e−2λrj

EMjS

]︸︷︷︸

ρtj2

, (9)

where we can see that the higher EMj

S , the lower the overallSGD noise. The noise depends on the local loss functionrj , on the server parameters (number of epochs E, learningrate λ, and number of samples per minibatch S), and on theclients’ data specific parameters (SGD variance σtj

2 ).

Equation (8) shows that clients’ parameters observed dur-ing federated learning can be expressed as θtj = θ̂tj + ρtjζj,t,where, given θt, θ̂tj is a deterministic component correspond-ing to the model obtained with EMj

S steps of gradient de-scents, and ζj,t is a delta-correlated Gaussian white noise.We consider in what follows a constant local noise vari-ance σ2

j (this assumption will be relaxed in Section 2.5.3 toconsider instead time-varying noise functions ρtj).

Based on this formalism, in the next Section we study abasic free-rider strategy simply consisting in returning ateach iteration the received global parameters. We call thistype of attack plain free-riding.

2.4 Plain free-riding

We denote by θ̃ and θ̃j respectively the global and localmodel parameters obtained in presence of free-riders. Theplain free-rider returns the same model parameters as thereceived ones, i.e. ∀k ∈ K, θ̃t+1

k = θ̃t. In this setting, theserver aggregation process (1) can be rewritten as:

θ̃t+1 =∑j∈J

Mj

Nθ̃t+1j +

MK

Nθ̃t , (10)

where θ̃t is the global model and θ̃tj are the fair clients’ localmodels uploaded to the server for free-riding.

2.4.1 Free-riders perturbation of the fair clients localmodel

In this section, we investigate the effect of the free-riders onthe local optimization performed by the fair clients at everyserver iteration. The participation of the free-riders to fed-erated learning implies that the processes of the fair clientsare being perturbed by the attacks throughout training. Inparticular, the initial conditions of the local optimizationproblems are modified according to the perturbed aggrega-tion of equation (10).

Back to the assumptions of Section 2.3 , the initial conditionθ̃t of the local optimization includes now the aggregatedmodel of the fair clients and a perturbation coming from thefree-riders. Thus, equation (8) in presence of free-riding canbe written as

θ̃t+1j = ηj [θ̃

t − θ∗j ] + θ∗j

+λ√S

∫ EMjS

u=0

e−λrj

(EMj

S −u)σ̃tjdWu, (11)

where σ̃tj = σtj(θ̃j) is the SGD variance for free-riding.We consider that σ̃tj = σtj = σj . This assumption will berelaxed in Section 2.5.3 to consider instead time-varyingnoise functions. With analogous considerations to thosemade in Section 2.3, the updated parameters take the form:

θ̃t+1j = ηj [θ̃

t − θ∗j ] + θ∗j + ρj ζ̃j,t, (12)

where ζ̃j,t is a delta-correlated Gaussian white noise. Simi-larly as for federated learning, E

[θ̃t+1j |θ̃t

]= ηj [θ̃

t − θ∗j ] +θ∗j , and Var

[θ̃t+1j |θ̃t

]= ρ2j .

We want to express the global optimization process θ̃t dueto free-riders in terms of a a perturbation of the equivalentstochastic process θt obtained with fair clients only. Theo-rem 1 provides a recurrent form for the difference betweenthese two processes.

Theorem 1. Under the assumptions of Section 2.3 and 2.4for the local optimization processes resulting from federatedlearning with respectively only fair clients and with free-riders, the difference between the aggregation processes offormulas (1) and (10) takes the following recurrent form:

θ̃t − θt =

t−1∑i=0

(ε+

MK

N

)t−i−1f(θi) (13)

+

t−1∑i=0

(ε+

MK

N

)t−i−1(ν̃i − νi),

with f(θt) = MK

N

[θt −∑j∈J

Mj

N−MK[ηj(θ

t − θ∗j ) + θ∗j ]],

ε =∑j∈J

Mj

N ηj , νt =∑j∈J

Mj

N−MKρjζj,t and

ν̃t =∑j∈J

Mj

N ρj ζ̃j,t.


We note that in the special case with no free-riders (i.e.MK = 0), the quantity θ̃t − θt depends on the secondterm of equation (13) only, and represents the comparisonbetween two different realizations of the stochastic processassociated to the federated global model. Theorem 1 showsthat in this case the variance across optimization results isnon-zero, and depends on the intrinsic variability of the localoptimization processes quantified by the variable νt. Wealso note that in presence of free-riders the convergence tothe model obtained with fair clients depends on the relativesample size declared by the free-riders MK

N .

2.4.2 Convergence analysis of plain free-riding

Based on the relationship between the learning processesestablished in Theorem 1, we are now able to prove thatfederated learning with plain free-riders defined in equation(10) converges in expectation to the aggregated model ofthe fair clients of equation (1).

Theorem 2 (Plain free-riding). Assuming FedAvg con-verges in expectation, and based on the assumption of Theo-rem 1, the following asymptotic properties hold:

E[θ̃t − θt

]t→+∞−−−−→ 0, (14)

Var[θ̃t − θt

]t→+∞−−−−→

[ 1N2 + 1

(N−MK)2 ]∑j∈J (Mjρj)

2

1−(ε+ MK

N

)2 .

(15)

As a corollary of Theorem 2, in Proof A.2 it is shown thatthe asymptotic variance is strictly increasing with the sam-ple size MK declared by the free-riders. In practice, thesmaller the total number of data points declared by the free-riders, the closer the final aggregation result approachesthe model obtained with fair clients only. On the contrary,when the the sample size of the fair clients is negligiblewith respect to the the one declared by the free-riders, i.e.N 'MK , the variance tends to infinity. This is due to theratio approaching to 1 in the geometric sum of the secondterm of equation (13). In the limit case when only free-riders participate to federated learning (J = ∅), we obtaininstead the trivial result θ̃t = θ0 and Var

[θ̃t]

= 0. Inthis case there is no learning throughout the training pro-cess. Finally, with no free-riders (MK = 0), we obtainVar

[θ̃t1 − θt2

]t→+∞−−−−→ 2

N21

1−ε2∑j∈J (Mjρj)

2, reflectingthe variability of the fair aggregation process due to thestochasticity of the local optimization processes.

2.5 Disguised free-riding

Plain free-riders can be easily detected by the server, sincefor each iteration the condition [θ̃t+1

k − θ̃t = 0] is true. Inwhat follows, we study improved attack strategies basedon the sharing of opportunely disguised parameters, and

investigate sufficient conditions on the disguising modelsto obtain the desired convergence behavior of free-riderattacks.

2.5.1 Additive noise to mimic SGD updates

A disguised free-rider with additive noise generalizes theplain one, and uploads parameters θ̃t+1

k = θ̃t + ϕk(t)εt.Here, the perturbation εt is assumed to be Gaussian whitenoise, and ϕk(t) > 0 is a suitable time-varying perturbationcompatible with the free-rider attack. As shown in equation(8), the parameters uploaded by the fair clients take thegeneral form composed of an expected model corrupted bya stochastic perturbation due to SGD. Free-riders can mimicthis update form by adopting a noise structure similar to theone of the fair clients:

ϕ2k(t) =

λ

Sσtk

2 1

2rk

[1− e−2λrk

EMkS

], (16)

where rk and σtk would ideally depend on the (non-existing)free-rider data distribution and thus need to be determined,while Mk is the declared number of samples. Compatiblywith the assumptions of constant SGD variance σ2

j for thefair clients, we here assume that the free-riders noise isconstant and compatible with the SGD form:

ϕ2k =

λ

Sσ2k

1

2rk

[1− e−2λrk

EMkS

]. (17)

The parameters rk and σk affect the noise level and decayof the update, and thus the ability of the free-rider of mim-icking a realistic client. These parameters can be ideallyestimated by computing a plausible quadratic approximationof the local loss function (Section 2.3). While the estimationmay require the availability of some form of data for thefree-rider, in Section 2.5.2 we prove that, for any combina-tion of rk and σk, federated learning still converges to thedesired aggregated target.

Analogously as for the fair clients, this assumption will berelaxed in Section 2.5.3.

2.5.2 Attacks based on fixed additive stochasticperturbations

In this new setting, we can rewrite the FedAvg aggrega-tion process (1) for an attack with a single free-rider withperturbation ϕ:

θ̃t+1 =∑j∈J

Mj

Nθ̃t+1j +

MK

Nθ̃t +

MK

Nϕεt. (18)

Theorem 3 extends the results previously obtained for fed-erated learning with plain free-riders to our new case withadditive perturbations.


Theorem 3 (Single disguised free-rider). Analogously toTheorem 2, the aggregation process under free-riding de-scribed in equation (18) converges in expectation to theaggregated model of the fair clients of equation (1) :

E[θ̃t − θt

]t→+∞−−−−→ 0, (19)

Var[θ̃t − θt

]t→+∞−−−−→

[ 1N2 + 1


2

1−(ε+ MK

N

)2+

1

1−(ε+ MK

N

)2 M2K

N2ϕ2. (20)

Theorem 3 shows that disguised free-riding converges to thefinal model of federated learning with fair clients, althoughwith a higher variance resulting from the free-rider’s per-turbations injected at every iteration. The perturbation isproportional to MK

N , the relative number of samples declaredby the free-rider.

The extension of this result to the case of multiple free-riders requires to account in equation (18) for an attack ofthe form

∑k∈K

Mk

N ϕkεk,t, where Mk is the total samplesize declared by free-rider k. Corollary 1 follows from thelinearity of this form.Corollary 1 (Multiple disguised free-riders). Assuming aconstant perturbation factor ϕk for each free-rider k, theasymptotic expectation of Theorem 3 still holds, while thevariance reduces to

Var[θ̃t − θt

]t→+∞−−−−→

[ 1N2 + 1


2

1−(ε+ MK

N

)2+

1

1−(ε+ MK

N

)2 ∑k∈K

M2k

N2ϕ2k. (21)

2.5.3 Time-varying noise model of fair-clientsevolution

To investigate more plausible parameters evolution in feder-ated learning, in this section we relax the assumption madein Section 2.3 about the constant noise perturbation of theSGD process across iteration rounds.

We assume here that the standard deviation σtj of SGDdecreases at each server iteration t, approaching to zero overiteration rounds: σtj

t→+∞−−−−→ 0. This assumption reflectsthe improvement of the fit of the global model θ̃t to thelocal datasets over server iterations, and implies that thestochastic process of the local optimization of Section 2.3has noise parameter ρtj

t→+∞−−−−→ 0. We thus hypothesizethat, to mimic the behavior of the fair clients, a suitabletime-varying perturbation of the free-riders should follow asimilar asymptotic behavior: ϕk(t)

t→+∞−−−−→ 0. Under theseassumptions, Corollary 2 shows that the asymptotic varianceof model aggregation under free-rider attacks is zero, andthat it is thus still possible to retrieve the fair client’s model.

Corollary 2. Assuming that fair clients and free-ridersevolve according to Section 2.3 to 2.5, if the conditionsρtj

t→+∞−−−−→ 0 and ϕk(t)t→+∞−−−−→ 0 are met, the aggregation

process of federated learning is such that the asymptoticvariance of Theorems 2 and 3 reduce to

Var[θ̃t − θt

]t→+∞−−−−→ 0. (22)

We assumed in Corollary 2 that the SGD noise σtj decreasesat each server iteration and eventually converges to 0. Inpractice, the global model may not fit perfectly the datasetof the different clients Dj and, after a sufficient number ofoptimization rounds, may keep oscillating around a localminima. We could therefore assume that σtj

t→+∞−−−−→ σj

leading to ρtjt→+∞−−−−→ ρj . In this case, to mimic the behav-

ior of the fair clients, a suitable time-varying perturbationcompatible with the free-rider attacks should converge toa fixed noise level such that ϕk(t)

t→+∞−−−−→ ϕk. Similarlyas for Corollary 2, it can be shown that under these hypoth-esis federated learning follows the asymptotic behaviorsof Theorem 2 and 3 for respectively plain and disguisedfree-riders.

2.6 FedProx

FedProx includes a regularization term for the local lossfunctions of the different clients ensuring the proximitybetween the updated models θt+1

j and θt. This regulariza-tion is usually defined as an additional L2 penalty term,and leads to the following form for the local gradientgj(θj) ' rj [θj − θ∗j ] +µ[θj − θt] where µ is a trade-off pa-rameter. Since the considerations in Section 2.3 still hold inthis setting, we can express the local model contribution forFedProx with a formulation analogous to the one of equation(8). Hence, for FedProx, we obtain similar conclusions forTheorem 2 and 3, as well as for Corollary 1 and 2, provingthat the convergence behavior with free-riders is equivalentto the one obtained with fair clients only, although with adifferent asymptotic variance (Appendix B).

Theorem 4. Assuming convergence in expectation for fed-erated learning with fair clients only, under the assumptionsof Theorem 1 the asymptotic properties of plain and dis-guised free-riding of Theorem 2, 3, and Corollary 1, 2, stillhold with FedProx. In this case we have parameters:

ρj2 =

λ

Sσj

2 1

2(rj + µ)

[1− e−2λ(rj+µ)

EMjS

], (23)

ε =∑j∈J

Mj

N[γj + µ

1− γjrj + µ

], (24)

and γj = e−λ(rj+µ)EMj

S . (25)

We note that the asymptotic variance is still strictly increas-ing with the total number of free-riders samples. Moreover,


0 10 20 30 40 5030.0

32.5

35.0

37.5

40.0

FedAvg

1fre

e-rid

er

Only FairPlain

Disguised σ, γ = 1

Disguised 3σ, γ = 1

0 10 20 30 40 5030.0

32.5

35.0

37.5

40.0

FedProx

0 20 40 60 80 10030.0

32.5

35.0

37.5

40.0

42.5

5fre

e-rid

ers

0 20 40 60 80 10030.0

32.5

35.0

37.5

40.0

0 20 40 60 80 100 120 14030.0

32.5

35.0

37.5

40.0

42.5

45fre

e-rid

ers

0 20 40 60 80 100 120 14030.0

32.5

35.0

37.5

40.0

42.5

Figure 1: Plots for Shakespeare and E = 20. Accuracyperformances for FedAvg and FedProx according to thenumber of free-riders participating in the learning process:15% (top), 50% (middle), and 90% (bottom) of the totalamount of clients. The shaded blue region indicates thevariability of federated learning model with fair clients only,estimated from 30 different training initialization.

the regularization term monitors the asymptotic variance: ahigher regularization leads to a smaller noise parameter ρ2jand to a smaller ε, thus decreasing the asymptotic variancesof Theorem 2, 3, and Corollary 1, 2.

3 Experiments

This experimental section focuses on a series of bench-marks for the proposed free-rider attacks. The methodsbeing of general application, the focus here is to empir-ically demonstrate our theory on diverse experimentalsetups and model specifications. All code, data andexperiments are available at https://github.com/Accenture/Labs-Federated-Learning/tree/free-rider_attacks.

3.1 Experimental Details

We consider 5 fair clients for each of the following scenar-ios, investigated in previous works on federated learning[McMahan et al., 2017, Li et al., 2018]:

MNIST (classification in iid and non-iid settings). Westudy a standard classification problem on MNIST[LeCun et al., 1998] and create two benchmarks: an iiddataset (MNIST iid) where we assign 600 training digitsand 300 testing digits to each client, and a non-iid dataset(MNIST non-iid), where for each digit we create two shardswith 150 training samples and 75 testing samples, and allo-cate 4 shards for each client. For each scenario, we use alogistic regression predictor.

0 10 20 30 40 50

100

FedAvg

1fre

e-rid

er

Only FairPlain



0 10 20 30 40 50

100

FedProx

0 20 40 60 80 100

100

5fre

e-rid

ers

0 20 40 60 80 100

100

0 20 40 60 80 100 120 140

10−1

100

45fre

e-rid

ers

0 20 40 60 80 100 120 140

100

Figure 2: Plots for Shakespeare and E = 20. Loss perfor-mances for FedAvg and FedProx according to the number offree-riders participating in the learning process: 15% (top),50% (middle), and 90% (bottom) of the total amount ofclients.

CIFAR-10[Krizhevsky et al., ] (image classification). Thedataset consists of 10 classes of 32x32 images withthree RGB channels. There are 50000 training exam-ples and 10000 testing examples which we partitionedinto 5 clients each containing 10000 training and 2000testing samples. The model architecture was taken from[McMahan et al., 2017] which consists of two convolu-tional layers and a linear transformation layer to producelogits.

Shakespeare (LSTM prediction). We study a LSTM modelfor next character prediction on the dataset of The CompleteWorks of William Shakespeare [McMahan et al., 2017]. Werandomly chose 5 clients with more than 3000 samples, andassign 70% of the dataset to training and 30% to testing.Each client has on average 6415.4 samples (±1835.6) . Weuse a two-layer LSTM classifier containing 100 hidden unitswith an 8 dimensional embedding layer. The model takesas an input a sequence of 80 characters, embeds each of thecharacters into a learned 8-dimensional space and outputsone character per training sample after 2 LSTM layers anda fully connected one.

We train federated models following FedAvg and FedProxaggregation processes. In FedProx, the hyperparameter µmonitoring the regularization is chosen according to the bestperforming scenario reported in [Li et al., 2018]: µ = 1 forMNIST (iid and non-iid), and µ = 0.001 for Shakespeare.For the free-rider we declare a number of samples equal tothe average sample size across fair clients. We test federatedlearning with 5 and 20 local epochs using SGD optimizationwith learning rate λ = 0.001 for MNIST (iid and non-iid),λ = 0.001 for CIFAR-10, and λ = 0.5 for Shakespeare,and batch size of 100. We evaluate the success of the free-

https://github.com/Accenture/Labs-Federated-Learning/tree/free-rider_attacks




rider attacks by quantifying testing accuracy and trainingloss of the resulting model, as indicators of the effect ofthe perturbation induced by free-riders on the final modelperformances. Resulting figures for associated accuracy andloss can be found in Figure 1, Figure 2 and Appendix C.

3.2 Free-rider attacks: convergence andperformances

In the following experiments, we assume that free-riders donot have any data, which means that they cannot estimatethe noise level by computing a plausible quadratic approxi-mation of the local loss function (Section 2.5). Therefore,we investigate free-rider attacks taking the simple formϕ(t) = σt−γ . The parameter γ is chosen among a panel oftesting parameters γ ∈ {0.5, 1, 2}, while additional exper-imental material on the influence of γ on the convergenceis presented in Appendix C. While the optimal tuning ofdisguised free-rider attacks is out of the scope of this study,in what follows the perturbations parameter σ is definedaccording to practical hypotheses on the parameters evolu-tion during federated learning. After random initializationat the initial federated learning step, the parameter σ is op-portunely estimated to mimic the extent of the distributionof the update ∆θ̃0 = θ̃1− θ̃0 observed between consecutiverounds of federated learning. We can simply model theseincrements as a zero-centered univariate Gaussian distribu-tion, and assign the parameter σ to the value of the fittedstandard deviation. According to this strategy, the free-riderwould return parameters θ̃tk with perturbations distributedas the ones observed between two consecutive optimizationrounds. Figure 1, top row, exemplifies the evolution of themodels obtained with FedAvg (20 local training epochs) onthe Shakespeare dataset with respect to different scenarios:1) fair clients only, 2) plain free-rider, 3) disguised free-riderwith decay parameter γ = 1, and estimated noise level σ,and 4) disguised free-rider with noise level increased to 3σ.For each scenario, we compare the federated model obtainedunder free-rider attacks with respect to the equivalent modelobtained with the participation of the fair clients only. Forthis latter setting, to assess the model training variability,we repeated the training 30 times with different parameterinitializations. The results show that, independently fromthe chosen free-riding strategy, the resulting models attainscomparable performances with respect to the one of themodel obtained with fair clients only (Figure 1, top row).Similar results are obtained for the setup with 5 local train-ing epochs and different values of γ, as well as for FedProxwith 5 and 20 local epochs (Appendix C).

We also investigate the same training setup under the in-fluence of multiple free-riders (Figure 1, mid and bottomrows). In particular, we test the scenarios where the free-riders declare respectively 50% and 90% of the total trainingsample size. In practice, we maintain the same experimen-tal setting composed of 5 fair clients, and we increase the

number of free-riders to respectively 5 and 45, while declar-ing for each free-rider a sample size equal to the averagenumber of samples of the fair clients. Independently fromthe magnitude of the perturbation function, the number offree-riders does not seem to affect the performance of thefinal aggregated model. However, the convergence speedis greatly decreased. Figure 2 shows that the convergencein these different settings is not identically affected by thefree-riders. When the size of free-riders is moderate, e.g. upto 50% of the total sample size, the convergence speed ofthe loss is slightly slower than for federated learning withfair clients. The attacks can be still considered successful,as convergence is achieved within the pre-defined iterationbudget. However, when the size of free-riders reaches 90%,convergence to the optimum is extremely slow and cannotbe achieved anymore in a reasonable amount of iterations.This result is in agreement with our theory, for which theconvergence speed inversely proportional to the relative sizeof the free-riders. Interestingly, we note that the final ac-curacy obtained in all the scenarios is similar (though a bitslower with 90% of free-riders), and falls within the vari-ability observed in federated learning with fair-clients only(Figure 1). This result is achieved in spite of the incompleteconvergence during training. This effect can be explainedby observing that this accuracy level is already reachedat the early training stages of federated learning with fairclients, while further training does not seem to improve thepredictions. This result suggests that, in spite of the verylow convergence speed, the averaging process with 90% offree-riders still achieves a reasonable minima compatiblewith the training path of the fair clients aggregation.

We note that the "peaks" observed in the loss of Figure 2are common in FL, especially in the considered applicationwhen the number of clients is low. It is important to noticethat our experiments are performed by using vanilla SGD.As such, the peaks for only fair clients are to be expected inboth loss and performances. We also notice that the peaksare smaller for free-riding because of the “regularization”effect of free-riders, which regresses the update towards theglobal model of the previous iteration.

Analogous results and considerations can be derived fromthe set of experiments on the remaining datasets, training pa-rameters and FedProx as an aggregation scheme (AppendixC).

4 Conclusion and discussion

We introduced a theoretical framework for the study of free-riding attacks on model aggregation in federated learning.Based on the proposed methodology, we proved that sim-ple strategies based on returning the global model at eachiteration already lead to successful free-rider attacks (plainfree-riding), and we investigated more sophisticated dis-guising techniques relying on stochastic perturbations of


the parameters (disguised free-riding). The convergenceof each attack was demonstrated through theoretical devel-opments and experimental results. The threat of free-riderattacks is still under-investigated in machine learning. Forexample, current defence schemes in federated learning aremainly based on outliers detection mechanisms, to detectmalicious attackers providing abnormal updates. Theseschemes would be therefore unsuccessful in detecting a free-rider update which is, by design, equivalent to the globalfederated model.

This work opens the way to the investigation of optimal dis-guising and defense strategies for free-rider attacks, beyondthe proposed heuristics. Our experiments show that inspec-tion of the client’s distribution should be established as aroutine practice for the detection of free-rider attacks in fed-erated learning. Further research directions are representedby the improvement of detection at the server level, throughbetter modeling of the heterogeneity of the incoming clients’parameters. This study provides also the theoretical basisfor the study of effective free-riding strategies, based on dif-ferent noise model distributions and perturbation schemes.Finally, in this work we relied on a number of hypothesisconcerning the evolution of the clients’ parameters duringfederated learning. This choice provides us with a conve-nient theoretical setup for the formalization of the proposedtheory which may be modified in the future, for example,for investigating more complex forms of variability andschemes for parameters aggregation.

Acknowledgments and Disclosure of Funding

This work has been supported by the French government,through the 3IA Côte d’Azur Investments in the Futureproject managed by the National Research Agency (ANR)with the reference number ANR-19-P3IA-0002, and by theANR JCJC project Fed-BioMed 19-CE45-0006-01. Theproject was also supported by Accenture. The authors aregrateful to the OPAL infrastructure from Université Côted’Azur for providing resources and support.

References

[Ateniese et al., 2015] Ateniese, G., Mancini, L. V., Spog-nardi, A., Villani, A., Vitali, D., and Felici, G. (2015).Hacking smart machines with smarter ones: How toextract meaningful data from machine learning classi-fiers. International Journal of Security and Networks,10(3):137–150.

[Bhagoji et al., 2019] Bhagoji, A. N., Chakraborty, S., Mit-tal, P., and Calo, S. (2019). Analyzing federated learningthrough an adversarial lens. 36th International Confer-ence on Machine Learning, ICML 2019, 2019-June:1012–1021.

[Brisimi et al., 2018] Brisimi, T., Chen, R., Mela, T., Ol-shevsky, A., Paschalidis, I., and Shi, W. (2018). Feder-ated learning of predictive models from federated elec-tronic health records. International Journal of MedicalInformatics, 112.

[Carlini et al., 2019] Carlini, N., Liu, C., Erlingsson, Ú.,Kos, J., and Song, D. (2019). The secret sharer: Evaluat-ing and testing unintended memorization in neural net-works. In 28th USENIX Security Symposium (USENIXSecurity 19), pages 267–284, Santa Clara, CA. USENIXAssociation.

[Fredrikson et al., 2015] Fredrikson, M., Jha, S., and Ris-tenpart, T. (2015). Model inversion attacks that exploitconfidence information and basic countermeasures. InProceedings of the 22nd ACM SIGSAC Conference onComputer and Communications Security, CCS ’15, page1322–1333, New York, NY, USA. Association for Com-puting Machinery.

[Fredrikson et al., 2014] Fredrikson, M., Lantz, E., Jha, S.,Lin, S., Page, D., and Ristenpart, T. (2014). Privacyin pharmacogenetics: An end-to-end case study of per-sonalized warfarin dosing. In 23rd USENIX SecuritySymposium (USENIX Security 14), pages 17–32, SanDiego, CA. USENIX Association.

[Fung et al., 2020] Fung, C., Yoon, C. J. M., and Beschast-nikh, I. (2020). The limitations of federated learning insybil settings. In 23rd International Symposium on Re-search in Attacks, Intrusions and Defenses (RAID 2020),pages 301–316, San Sebastian. USENIX Association.

[He et al., 2018] He, L., Meng, Q., Chen, W., Ma, Z. M.,and Liu, T. Y. (2018). Differential equations for model-ing asynchronous algorithms. IJCAI International JointConference on Artificial Intelligence, 2018-July(1):2220–2226.

[Hitaj et al., 2017] Hitaj, B., Ateniese, G., and Perez-Cruz,F. (2017). Deep Models under the GAN: Informationleakage from collaborative deep learning. Proceedings ofthe ACM Conference on Computer and CommunicationsSecurity, pages 603–618.

[Krizhevsky et al., ] Krizhevsky, A., Nair, V., and Hinton,G. Cifar-10 (canadian institute for advanced research).

[LeCun et al., 1998] LeCun, Y., Bottou, L., Bengio, Y.,and Ha, P. (1998). LeNet. Proceedings of the IEEE,(November):1–46.

[Li et al., 2016] Li, B., Wang, Y., Singh, A., and Vorobey-chik, Y. (2016). Data poisoning attacks on factorization-based collaborative filtering. Advances in Neural Infor-mation Processing Systems, (Nips):1893–1901.


[Li et al., 2017] Li, Q., Tai, C., and Weinan, E. (2017).Stochastic modified equations and adaptive stochasticgradient algorithms. 34th International Conference onMachine Learning, ICML 2017, 5:3306–3340.

[Li et al., 2018] Li, T., Sahu, A. K., Zaheer, M., Sanjabi,M., Talwalkar, A., and Smith, V. (2018). FederatedOptimization in Heterogeneous Networks. Proceedingsof the 1 st Adaptive & Multitask Learning Workshop,Long Beach, California, 2019, pages 1–28.

[Lin et al., 2019] Lin, J., Du, M., and Liu, J. (2019). Free-riders in Federated Learning: Attacks and Defenses.http://arxiv.org/abs/1911.12560.

[Mandt et al., 2017] Mandt, S., Hof Fman, M. D., and Blei,D. M. (2017). Stochastic gradient descent as approxi-mate Bayesian inference. Journal of Machine LearningResearch, 18:1–35.

[McMahan et al., 2017] McMahan, H., Moore, E., Ram-age, D., Hampson, S., and Agüera y Arcas, B. (2017).Communication-efficient learning of deep networks fromdecentralized data. Proceedings of the 20th InternationalConference on Artificial Intelligence and Statistics, AIS-TATS 2017, 54.

[Orvieto and Lucchi, 2018] Orvieto, A. and Lucchi, A.(2018). Continuous-time Models for Stochastic Opti-mization Algorithms. (NeurIPS).

[Shen et al., 2016] Shen, S., Tople, S., and Saxena, P.(2016). AUROR: Defending against poisoning attacksin collaborative deep learning systems. In ACM In-ternational Conference Proceeding Series, volume 5-9-Decemb, pages 508–519.

[Silva et al., 2019] Silva, S., Gutman, B. A., Romero, E.,Thompson, P. M., Altmann, A., and Lorenzi, M. (2019).Federated learning in distributed medical databases:Meta-analysis of large-scale subcortical brain data. In2019 IEEE 16th International Symposium on BiomedicalImaging (ISBI 2019), pages 270–274. IEEE.

[Uhlenbeck and Ornstein, 1930] Uhlenbeck, G. E. andOrnstein, L. S. (1930). On the theory of the brownianmotion. Phys. Rev., 36:823–841.

[Wang et al., 2019] Wang, Z., Song, M., Zhang, Z., Song,Y., Wang, Q., and Qi, H. (2019). Beyond Inferring ClassRepresentatives: User-Level Privacy Leakage from Fed-erated Learning. Proceedings - IEEE INFOCOM, 2019-April:2512–2520.

[Xie et al., 2019] Xie, C., Huang, K., Chen, P.-Y., and Li,B. (2019). Dba: Distributed backdoor attacks against fed-erated learning. In International Conference on LearningRepresentations.

[Yin et al., 2018] Yin, D., Chen, Y., Ramchandran, K., andBartlett, P. (2018). Byzantine-robust distributed learn-ing: Towards optimal statistical rates. 35th InternationalConference on Machine Learning, ICML 2018, 13:8947–8956.


A Complete Proofs for FedAvg

A.1 Proof of Theorem 1

We prove with a reasoning by induction that:

θ̃t − θt =

t−1∑i=0

(ε+

MK

N

)t−i−1f(θi)

+

t−1∑i=0

(ε+

MK

N

)t−i−1(ν̃i − νi), (26)

with f(θt) = MK

N

[θt −∑j∈J

Mj

N−MK[ηj(θ

t − θ∗j ) + θ∗j ]],

ε =∑j∈J

Mj

N ηj , νt =∑j∈J

Mj

N−MKρjζj,t and

ν̃t =∑j∈J

Mj

N ρj ζ̃j,t. By definition of θt+1,E [f(θt)] = MK

N

[E [θt]− E

[θt+1

]].

Proof. Server iteration t = 1

Using the fair clients local model parameters evolution ofSection 2.3 and the server aggregation process expressed inequation (10), the global model can be written as

θ1 =∑j∈J

Mj

N −MK

[ηj(θ0 − θ∗j

)+ θ∗j

]+ ν0. (27)

Similarly, the global model for federated learning with plainfree-riders can be expressed as

θ̃1 =∑j∈J

Mj

N

[ηj(θ0 − θ∗j

)+ θ∗j

]+MK

Nθ0 + ν̃0. (28)

By subtracting equation (27) to equation (28), we obtain:

θ̃1 − θ1 = −MK

N

∑j∈J

Mj

N −MK

[ηj(θ0 − θ∗j

)+ θ∗j

]+MK

Nθ0 + ν̃0 − ν0 (29)

Hence, θ̃1 − θ1 follows the formalization.

From t to t+ 1

We suppose the property true at a server iteration t. Hence,we get:

θ̃t − θt =

t−1∑i=0

(ε+

MK

N

)t−i−1f(θi)

+

t−1∑i=0

(ε+

MK

N

)t−i−1(ν̃i − νi), (30)

With the same reasoning as for t = 1, we get:

θt+1 =∑j∈J

Mj

N −MK

[ηj(θt − θ∗j

)+ θ∗j

]+ νt (31)

and

θ̃t+1 =∑j∈J

Mj

N

[ηj

(θ̃t − θ∗j

)+ θ∗j

]+MK

Nθ̃t + ν̃t

(32)

By using equation (30) for equation (32), we get:

θ̃t+1 =∑j∈J

Mj

N

[ηj(θt − θ∗j

)+ θ∗j

]+ ε

t−1∑i=0

(ε+

MK

N

)t−i−1f(θi)

+ ε

t−1∑i=0

(ε+

MK

N

)t−i−1(ν̃i − νi)

+MK

Nθt

+MK

N

t−1∑i=0

(ε+

MK

N

)t−i−1f(θi)

+MK

N

t−1∑i=0

(ε+

MK

N

)t−i−1(ν̃i − νi)

+ ν̃t (33)

which can be rewritten as:

θ̃t+1 =∑j∈J

Mj

N

[ηj(θt − θ∗j

)+ θ∗j

]+ [ε+

MK

N]

t−1∑i=0

(ε+

MK

N

)t−i−1f(θi)

+ [ε+MK

N]

t−1∑i=0

(ε+

MK

N

)t−i−1(ν̃i − νi)

+MK

Nθt + ν̃t, (34)

leading to

θ̃t+1 =∑j∈J

Mj

N

[ηj(θt − θ∗j

)+ θ∗j

]+

t−1∑i=0

(ε+

MK

N

)t−if(θi)

+

t−1∑i=0

(ε+

MK

N

)t−i(ν̃i − νi)

+MK

Nθt + ν̃t (35)


By subtracting equation (35) to equation (31), we obtain:

θ̃t+1 − θt+1 = −MK

N

∑j∈J

Mj

N −MK

[ηj(θt − θ∗j

)+ θ∗j

]+

t−1∑i=0

(ε+

MK

N

)t−if(θi)

+

t−1∑i=0

(ε+

MK

N

)t−i(ν̃i − νi)

+MK

Nθt + ν̃t − νt (36)

Given that −MK

N

∑j∈J

Mj

N−MK

[ηj(θt − θ∗j

)+ θ∗j

]+

MK

N θt = f(θt), we get:

θ̃t+1 − θt+1 =

t∑i=0

(ε+

MK

N

)t−if(θi)

+

t∑i=0

(ε+

MK

N

)t−i(ν̃i − νi). (37)


Proof. Expected Value

Let us first have a look at the expected value. By def-inition, a sum of Gaussian distributions with 0 mean,E [νi] = 0 and E [ν̃i] = 0. We also notice that E [f(θt)] =MK

N

[E [θt]− E

[θt+1

]]. Hence, we obtain

E[θ̃t − θt

]=MK

N

t−1∑i=0

(ε+

MK

N

)n−i−1E[θt − θt+1

].

(38)

We consider that federated learning is converging, hence|E [θt]−E

[θt+1

]| t→+∞−−−−→ 0, and for any positive α, there

exists N0 such that |E[θt − θt+1

]| < α. Since ηj ∈]0, 1[,

we have ε ∈]0, N−MK

N [ and ε+ MK

N ∈]0, 1[. Thus, we canrewrite equation (38) as

|E[θ̃t − θt

]| ≤

N0−1∑i=0

(ε+

MK

N

)t−i−1|E[θt]− E

[θt+1

]|

+

t−1∑i=N0

(ε+

MK

N

)t−i−1α. (39)

We define by Rα = maxi∈[1,N0] |E [θt] − E[θt+1

]|, and

get:

|E[θ̃t − θt

]| ≤

N0−1∑i=0

(ε+

MK

N

)t−i−1︸︷︷︸

A

Rα

+

t−1∑i=N0

(ε+

MK

N

)t−i−1︸︷︷︸

B

α. (40)

• Expressing A.

A =

N0−1∑i=0

(ε+

MK

N

)t−i−1(41)

=

(ε+

MK

N

)t−1 1−(ε+ MK

N

)−N0

1−(ε+ MK

N

)−1 (42)

t→+∞−−−−→ 0 (43)

• Expressing B.

B =

t−1∑i=N0

(ε+

MK

N

)t−i−1(44)

=

(ε+

MK

N

)t−N0−1 1−(ε+ MK

N

)−(t−N0)

1−(ε+ MK

N

)−1(45)

=1−

(ε+ MK

N

)t−N0

1−(ε+ MK

N

) (46)

t→+∞−−−−→ 1

1−(ε+ MK

N

) > 0 (47)

Using equation (43) and (47) in equation (40), we get:

∀α limt→+∞

|E[θ̃t − θt

]| ≤ Bα, (48)

which is equivalent to

limt→+∞

E[θ̃t − θt

]= 0. (49)

Variance

The Wiener processes, νi and ν̃i are independent from theserver models parameters θi. Also, each Wiener process isindependent with the other Wiener processes. Hence, weget:

Var[θ̃t − θt

]= Var

[t−1∑i=0

(ε+

MK

N

)t−i−1f(θi)

]︸︷︷︸

E

+

t−1∑i=0

(ε+

MK

N

)2(t−i−1)Var [ν̃i − νi]︸︷︷︸

F

,

(50)


Expressing E. Before getting a simpler expression for E,we need to consider Cov

[f(θl), f(θm)

]. To do so, we first

consider f(θt)− E [f(θt)].

f(θt)− E[f(θt)

]=MK

N

1−∑j∈J

Mj

N −MKηj

︸︷︷︸

G

[θt − E[θt]], (51)

We can prove with a reasoning by induction that

θt − E [θt] =∑n−1i=0

(∑j∈J

Mj

N−MKηj

)t−i−1νi =∑n−1

k=0 εt−i−1νi. All the νi are independent across each

others and have 0 mean, hence:

Cov [f(θl), f(θm)] = G2

min{l−1,m−1}∑i=0

εl+m−2i−2 E[ν2i]

(52)

Considering that E[ν2i]

= Var [νi] =∑j∈J

(Mj

N−MKρj

)2,

we get:

Cov[f(θl), f(θm)

]= G2

∑j∈J

(Mj

N −MKρj

)2 min{l−1,m−1}∑i=0

εt−i−1 (53)

We define G′ = G2∑j∈J

(Mj

N−MKρj

)2. Given that ε ∈

]0, 1[, we get the following upper bound on E:

Cov[f(θl), f(θm)

]≤ G′min{l,m} (54)

By denoting H = ε+ MK

N , we can rewrite E as:

E =

t−1∑l=0

t−1∑m=0

H2(t−1)−l−m Cov[fl(θ

l), f(θm)]

(55)

≤t−1∑l=0

t−1∑m=0

H2(t−1)−l−mG′min{l,m} (56)

Considering that min{l,m} ≤ l, we get:

E ≤ G′t−1∑l=0

t−1∑m=0

H2(t−1)−l−ml (57)

= G′H2(t−1)t−1∑l=0

H−llt−1∑m=0

H−m (58)

= G′H2(t−1)t−1∑l=0

H−ll1−H−n1−H−1 (59)

= G′H2(t−1) 1−H−n1−H−1

t−1∑l=0

H−ll (60)

Considering the power series∑+∞k=0 nx

n = x(1−x)2 , we get

that∑t−1l=0 H

−ll = H−1

(1−H−1)2 . Hence, E’s upper boundgoes to 0. Given that E is non-negative, we get:

Et→+∞−−−−→ 0 (61)

Expressing F . Let us first consider the noise coming fromthe SGD steps. All the ν̃i are independent with νi. Hence,we have

F = Var [ν̃i]−Var [νi] (62)

= Var

∑j∈J

Mj

Nρj ζ̃j,i −

∑j∈J

Mj

N −MKρjζj,i

(63)

= [1

N2+

1

(N −MK)2]∑j∈J

(Mjρj)2 (64)

Replacing (64) in equation (50), we can express the varianceas

Var[θ̃t − θt

]= E + F

t−1∑i=0

H2(t−i−1) (65)

= E + FH2(t−1)t−1∑i=0

H−2i (66)

= E + FH2(t−1) 1−H−2t1−H−2 (67)

= E + F1−H2t

1−H2(68)

By replacing F and H with their respective expression, wecan conclude that

Var[θ̃t − θt

]t→+∞−−−−→

[ 1N2 + 1


2

1−(ε+ MK

N

)2(69)

Note 1: The asymptotic variance is strictly increasing withthe number of data points declared by the free-riders MK .

While Mj and ρj are constants and independent from thenumber of free-riders and from their respective numberof data points, N and ε depend on the total number offree-riders’ samples MK . We first rewrite ε = 1

N α withα =

∑j∈JMjηj not depending on MK and we get:

ε+MK

N=

1

N[α+MK ]. (70)

By defining MJ =∑j∈JMj , we get:

1−(ε+

MK

N

)2

=1

N2[M2

J + 2MK [MJ − α]− α2],

(71)


with MJ − α > 0 because ηj ∈]0, 1[.

Also, considering that

1

N2+

1

(N −MK)2=

1

N2[M2K

M2J

+ 2MK

MJ+ 2], (72)

we can rewrite

1N2 + 1

(N−MK)2

1−(ε+ MK

N

)2 =

M2K

M2J

+ 2MK

MJ+ 2

M2J + 2MK [MJ − α]− α2

(73)

As the numerator is a polynomial of order 2 in MK andthe denominator is a polynomial of order 1 in MK , theasymptotic variance is increasing with MK .

Note 2: When considering that the SGD noise variance isdifferent for federated learning with and without free-riders,we get:

F =1

N2

∑j∈J

(Mj ρ̃j)2

+1

(N −MK)2

∑j∈J

(Mjρj)2

(74)


Proof. Relation between federated learning with andwithout free-riders global model

With a reasoning by induction similar to Proof A.1, we get:

θ̃t − θt =

t−1∑i=0

(ε+

MK

N

)t−i−1f(θi) (75)

+

t−1∑i=0

(ε+

MK

N

)t−i−1(ν̃i − νi) (76)

+

t−1∑i=0

(ε+

MK

N

)t−i−1MK

Nϕεt, (77)

Expected value

εt is a delta-correlated Gaussian White noise which impliesthat E [εt] = 0. Following the same reasoning steps as inProof A.2, we get:

limt→+∞

E[θ̃t − θt

]= 0. (78)

Variance

All the εt are independent Gaussian white noises implyingVar [εt] = 1. Following the same reasoning steps as in

Proof A.2, we get:

Var

[t−1∑i=0

(ε+

MK

N

)t−i−1MK

Nϕεt

]

=

t−1∑i=0

(ε+

MK

N

)2(t−i−1)M2K

N2ϕ2 (79)

=

(ε+

MK

N

)2(t−1) 1−(ε+ MK

N

)−2t1−

(ε+ MK

N

)−2 M2K

N2ϕ2 (80)

=1−

(ε+ MK

N

)2t1−

(ε+ MK

N

)2 M2K

N2ϕ2 (81)

t→+∞−−−−→ 1

1−(ε+ MK

N

)2 M2K

N2ϕ2 (82)

As for equation (50), all the εt are independent from νt,from ν̃t, and from the global model parameters θt. Hence,for one disguised free-rider we get the following asymptoticvariance:

Var[θ̃t − θt

]t→+∞−−−−→

[ 1N2 + 1


2

1−(ε+ MK

N

)2+

1

1−(ε+ MK

N

)2 M2K

N2ϕ2. (83)

A.4 Proof of Corollary 1


With a reasoning by induction similar to Proof A.1, we get:

θ̃t − θt =

t−1∑i=0

(ε+

MK

N

)t−i−1f(θi) (84)

+

t−1∑i=0

(ε+

MK

N

)t−i−1(ν̃i − νi) (85)

+∑k∈K

t−1∑i=0

(ε+

MK

N

)t−i−1Mk

Nϕkεk,t, (86)

Expected value

εk,t are delta-correlated Gaussian White noises which im-plies that E [εk,t] = 0. Following the same reasoning stepsas in Proof A.2, we get:

limt→+∞

E[θ̃t − θt

]= 0. (87)

Variance


All the εk,t are independent Gaussian white noises overserver iterations t and free-riders indices k implyingVar [εt] = 1. Following the same reasoning steps as inProof A.2, we get:

Var

[t−1∑i=0

(ε+

MK

N

)t−i−1Mk

Nϕkεk,t

]t→+∞−−−−→ 1

1−(ε+ MK

N

)2 M2k

N2ϕ2k (88)

Like for equation (50), all the εk,t are independent fromνt, ν̃t and the global model parameters θt. Hence, for mul-tiple disguised free-rider we get the following asymptoticvariance:

Var[θ̃t − θt

]t→+∞−−−−→

[ 1N2 + 1


2

1−(ε+ MK

N

)2+

1

1−(ε+ MK

N

)2 ∑k∈K

M2k

N2ϕ2k. (89)

A.5 Proof of Corollary 2


The relation remains the same for Theorem 2, Theo-rem 3, and Corollary 1 by replacing ηj with ηj(t) =∑j ∈ JMj

N ρj(t) and ϕk by ϕk(t) for disguised free-riding.

Expected value

With ρtj and ϕ(t) the properties for ν̃t, νt, εt and εk,t remainidentical. Hence, they still are delta-correlated GaussianWhite noises implying that E [ν̃t] = E [νt] = E [εt] =E [εk,t] = 0. Hence, for Theorem 2, Theorem 3, and Corol-lary 1, we get:

limt→+∞

E[θ̃t − θt

]= 0. (90)

Variance

Variance asymptotic behaviour proven in Proof A.2, A.3,and A.4 can be reduced to the one in Proof A.2. Hence, F ,equation (64), need to be reexpressed to take into accountρj(t). All the ν̃i are still independent with νi. Hence, wehave:

F = Var [ν̃i(t)− νi(t)] (91)

= Var

∑j∈J

Mj

Nρtj ζ̃j,i −

∑j∈J

Mj

N −MKρtjζj,i

(92)

Considering that ρtjt→+∞−−−−→ 0, we get:

Ft→+∞−−−−→ 0 (93)

Using the same reasoning as the one used for the expectedvalue convergence in Proof A.2, we get that the SGD noisecontribution linked to F goes to 0 at infinity.

For the disguised free-riders, εk,t are still independent Gaus-sian white noises implying Var [εk,t] = 1. Hence, followinga reasoning similar to the on in Proof A.2, we get:

Var

[t−1∑i=0

(ε+

MK

N

)t−i−1MK

Nϕk(t)εk,t

]

=

t−1∑i=0

(ε+

MK

N

)2(t−i−1)M2K

N2ϕ2k(t) (94)

Considering that ϕk(t)t→+∞−−−−→ 0, by using the same rea-

soning as for the proof of the expected value for free-riders,Section XX, we get:

Var

[t−1∑i=0

(ε+

MK

N

)t−i−1MK

Nϕk(t)εk,t

]t→+∞−−−−→ 0

(95)

Hence, we can conclude that

Var[θ̃t − θt

]t→+∞−−−−→ 0. (96)

B Complete Proofs for FedProx

FedProx is a generalization of FedAvg. As such, we usethe proof done for FedAvg to prove convergence of free-riders attack using FedProx as an optimization solver. TheL2 norm monitored by µ changes the gradient as gj(θj) 'rj [θj − θ∗j ] + µ[θj − θt].Using equation (7), we then get:

dθj = −λ[rj [θj − θ∗j ] + µ[θj − θt]

]+

λ√Sσj(θj)dWj ,

(97)

leading to

θj(u) = e−λ[rj+µ]uθj(0) +rjθ∗j + µθt

rj + µ[1− e−λ(rj+µ)u]

+λ√S

∫ u

x=0

e−λ(rj+µ)(u−x)σj(θj)dWx. (98)

considering that θj(0) = θt, θj(EMj

S ) = θt+1j , and


σj(θj) = σtj , we get:

θt+1j = γjθ

t +rjθ∗j + µθt

rj + µ[1− γj ] (99)

+λ√S

∫ EMjS

x=0

e−λ(rj+µ)(EMj

S −x)σtjdWx, (100)

where γj = e−λ[rj+µ]EMj

S . We can reformulate this as

θt+1j = [γj + µ

1− γjrj + µ

]θt +rj

rj + µ[1− γj ]θ∗j (101)

+λ√S

∫ EMjS

x=0

e−λ(rj+µ)(EMj

S −x)σtjdWx, (102)

The SGD noise variance between two server iterations forFedProx is:

Var[θt+1j |θt

]=λ

Sσtj

2 1

2(rj + µ)

[1− e−2λ(rj+µ)

EMjS

]︸︷︷︸

ρtj2

,

(103)

We also define η′j = γj + µ1−γjrj+µ

and δj =rj

rj+µ[1 − γj ].

For FedAvg, µ = 0, we get η′j = ηj and δj = 1 − ηj . Byproperty of the exponential, γj ∈]0, 1[. As rj and µ are nonnegative, then η′j ∈]0, 1[ like ηj for FedAvg.

Theorem 1 for FedProx

We consider ρ′j2

= λSσj

2 12(rj+µ)

[1− e−2λ(rj+µ)

EMjS

]Using the same reasoning by induction as in Proof A.1, weget:

θ̃t − θt =t−1∑i=0

(ε′ +

MK

N

)t−i−1g(θi)

+

t−1∑i=0

(ε′ +

MK

N

)t−i−1(ν̃′i − ν′i), (104)

with g(θt) = MK

N

[θt −∑j∈J

Mj

N−MK[η′jθ

t + δjθ∗j ]],

ε′ =∑j∈J

Mj

N η′j , ν′t =

∑j∈J

Mj

N−MKρ′jζj,t and ν̃′t =∑

j∈JMj

N ρ′j ζ̃j,t.


Like for FedAvg, we make the assumption that federatedlearning without free-riders using FedProx converge. Inaddition, ν̃′t and ν′t are also independent delta-correlatedGaussian white noises. Following the same proof as inProof A.2, we thus get:

limt→+∞

E[θ̃t − θt

]= 0. (105)

and

Var[θ̃t − θt

]t→+∞−−−−→

[ 1N2 + 1

(N−MK)2 ]∑j∈J

(Mjρ

′j

)21−

(ε′ + MK

N

)2(106)

The asymptotic variance still strictly increases with MK .

Note: We introduce x = λ(rj + µ)EMj

S . By taking thepartial derivative of ρ′j with respect to µ, we get:

δρ′jδµ

=λ

2Sσ2j

1

(rj + µ)2[−1 + (1 + 2x)e−2x], (107)

which is strictly negative for a positive µ considering thatall the other constants are positive. Hence, the SGD noisevariance ρ′j is inversely proportional with the regularizationfactor µ.

Similarly, for ε′, by considering that η′j can be rewrittenas η′j = γj

rjrj+µ

+ µrj+µ

, the partial derivative of η′j withrespect to µ can be expressed as:

δη′jδµ

=rj

(rj + µ)2[1− (1− x)e−x], (108)

which is strictly positive. Hence η′j is strictly increasingwith the regularization µ and so is ε′.

Considering the behaviours of ε′ and ρ′j with respect to theregularization term µ, the more regularization is asked bythe server and the smaller the asymptotic variance is, leadingto more accurate free-riding attacks.


The free-riders mimic the behaviour of the fair clients.Hence, we get:

ϕk′2 =

λ

Sσk

2 1

2(rj + µ)

[1− e−2λ(rk+µ)

EMjS

](109)

leading to

Var[θ̃t − θt

]t→+∞−−−−→

[ 1N2 + 1

(N−MK)2 ]∑j∈J

(Mjρ

′j

)21−

(ε′ + MK

N

)2+

1

1−(ε′ + MK

N

)2 M2K

N2ϕ′2. (110)

For disguised free-riders, the variance is also inversely pro-portional to the regularization parameter µ.

Corollary 1 for FedProx

Similarly, for many free-riders, we get:

Var[θ̃t − θt

]t→+∞−−−−→

[ 1N2 + 1

(N−MK)2 ]∑j∈J

(Mjρ

′j

)21−

(ε′ + MK

N

)2+

1

1−(ε′ + MK

N

)2 M2K

N2

∑k∈K

ϕ′2k . (111)


C Additional experimental results

C.1 Accuracy Performances


0 20 40 60 80 100

80

82

84

86

88

90

92

94MNIST-iid

1fre

e-rid

er

Only FairPlainDisguised σ, γ = 1


Disguised σ, γ = 0.5

Disguised 3σ, γ = 0.5



0 20 40 60 80 100 120 140

76

78

80

82

84

86

88

90

92MNIST-shard

0 10 20 30 40 50 60 70

56

58

60

62

64

66

68

CIFAR-10

0 10 20 30 40 5030

32

34

36

38

40

42

Shakespeare

0 25 50 75 100 125 150 175 200

80

82

84

86

88

90

92

94

5fre

e-rid

ers

0 50 100 150 200 250 300

76

78

80

82

84

86

88

90

92

0 20 40 60 80 100 120 140

56

58

60

62

64

66

68

0 20 40 60 80 10030

32

34

36

38

40

42

0 50 100 150 200 250 300

80

82

84

86

88

90

92

94

45fre

e-rid

ers

0 50 100 150 200 250 300 350 400 450

76

78

80

82

84

86

88

90

92

0 25 50 75 100 125 150 175 200 225

56

58

60

62

64

66

68

0 20 40 60 80 100 120 14030

32

34

36

38

40

42

Figure 3: Accuracy performances for FedAvg and 20 epochs in the different experimental scenarios.

0 25 50 75 100 125 150 175 200

80

82

84

86

88

90

92

94

MNIST-iid

1fre

e-rid

er







0 50 100 150 200 250 300

76

78

80

82

84

86

88

90

92MNIST-shard

0 20 40 60 80 100 120 140

56

58

60

62

64

66

68

70

CIFAR-10

0 20 40 60 80 10030

32

34

36

38

40

42

44

Shakespeare

0 50 100 150 200 250 300 350 400

80

82

84

86

88

90

92

94

5fre

e-rid

ers

0 100 200 300 400 500 600

76

78

80

82

84

86

88

90

92

0 50 100 150 200 250 300

56

58

60

62

64

66

68

70

0 25 50 75 100 125 150 175 20030

32

34

36

38

40

42

44

0 100 200 300 400 500 600

80

82

84

86

88

90

92

94

45fre

e-rid

ers

0 100 200 300 400 500 600 700 800 900

76

78

80

82

84

86

88

90

92

0 50 100 150 200 250 300 350 400 450

56

58

60

62

64

66

68

70

0 50 100 150 200 250 30030

32

34

36

38

40

42

44

Figure 4: Accuracy performances for FedAvg and 5 epochs in the different experimental scenarios.


0 20 40 60 80 100

80

82

84

86

88

90

92

94

MNIST-iid

1fre

e-rid

er







0 20 40 60 80 100 120 140

76

78

80

82

84

86

88

90

92MNIST-shard

0 10 20 30 40 5030

32

34

36

38

40

42Shakespeare

0 25 50 75 100 125 150 175 200

80

82

84

86

88

90

92

94

5fre

e-rid

ers

0 50 100 150 200 250 300

76

78

80

82

84

86

88

90

92

0 20 40 60 80 10030

32

34

36

38

40

42

0 50 100 150 200 250 300

80

82

84

86

88

90

92

94

45fre

e-rid

ers

0 50 100 150 200 250 300 350 400 450

76

78

80

82

84

86

88

90

92

0 20 40 60 80 100 120 14030

32

34

36

38

40

42

Figure 5: Accuracy performances for FedProx and 20 epochs in the different experimental scenarios.

0 25 50 75 100 125 150 175 200

80

82

84

86

88

90

92

94

MNIST-iid

1fre

e-rid

er







0 50 100 150 200 250 300

76

78

80

82

84

86

88

90

92MNIST-shard

0 20 40 60 80 10030

32

34

36

38

40

42

44

Shakespeare

0 50 100 150 200 250 300 350 400

80

82

84

86

88

90

92

94

5fre

e-rid

ers

0 100 200 300 400 500 600

76

78

80

82

84

86

88

90

92

0 25 50 75 100 125 150 175 20030

32

34

36

38

40

42

44

0 100 200 300 400 500 600

80

82

84

86

88

90

92

94

45fre

e-rid

ers

0 100 200 300 400 500 600 700 800 900

76

78

80

82

84

86

88

90

92

0 50 100 150 200 250 30030

32

34

36

38

40

42

44

Figure 6: Accuracy performances for FedProx and 5 epochs in the different experimental scenarios.


0 20 40 60 80 100

10−4

10−3

10−2

10−1

100

101

102MNIST-iid

1fre

e-rid

er







0 20 40 60 80 100 120 140

100

101

102

MNIST-shard

0 10 20 30 40 50 60 70

100

CIFAR-10

0 10 20 30 40 50

100

6× 10−1

2× 100

3× 100

4× 100

Shakespeare

0 25 50 75 100 125 150 175 200

10−5

10−3

10−1

101

5fre

e-rid

ers

0 50 100 150 200 250 300

10−1

100

101

102

0 20 40 60 80 100 120 140

10−1

100

0 20 40 60 80 100

100

0 50 100 150 200 250 300

10−5

10−3

10−1

101

45fre

e-rid

ers

0 50 100 150 200 250 300 350 400 450

10−5

10−4

10−3

10−2

10−1

100

101

102

0 25 50 75 100 125 150 175 200 225

10−2

10−1

100

0 20 40 60 80 100 120 140

10−1

100

Figure 7: Loss performances for FedAvg and 20 epochs in the different experimental scenarios.

0 25 50 75 100 125 150 175 200

10−5

10−4

10−3

10−2

10−1

100

101

102MNIST-iid

1fre

e-rid

er







0 50 100 150 200 250 30010−1

100

101

102

MNIST-shard

0 20 40 60 80 100 120 140

10−1

100

CIFAR-10

0 20 40 60 80 100

100

Shakespeare

0 50 100 150 200 250 300 350 400

10−5

10−3

10−1

101

5fre

e-rid

ers

0 100 200 300 400 500 600

10−5

10−4

10−3

10−2

10−1

100

101

102

0 50 100 150 200 250 300

10−2

10−1

100

0 25 50 75 100 125 150 175 200

10−1

100

0 100 200 300 400 500 600

10−5

10−3

10−1

101

45fre

e-rid

ers

0 100 200 300 400 500 600 700 800 900

10−5

10−4

10−3

10−2

10−1

100

101

102

0 50 100 150 200 250 300 350 400 450

10−2

10−1

100

0 50 100 150 200 250 300

10−1

100

Figure 8: Loss performances for FedAvg and 5 epochs in the different experimental scenarios.


0 20 40 60 80 100

10−4

10−3

10−2

10−1

100

101

102MNIST-iid

1fre

e-rid

er







0 20 40 60 80 100 120 140100

101

102MNIST-shard

0 10 20 30 40 50

100

2× 100

3× 100

4× 100

Shakespeare

0 25 50 75 100 125 150 175 200

10−5

10−4

10−3

10−2

10−1

100

101

102

5fre

e-rid

ers

0 50 100 150 200 250 30010−1

100

101

102

0 20 40 60 80 100

100

0 50 100 150 200 250 300

10−5

10−3

10−1

101

45fre

e-rid

ers

0 50 100 150 200 250 300 350 400 45010−5

10−4

10−3

10−2

10−1

100

101

102

0 20 40 60 80 100 120 140

100

Figure 9: Loss performances for FedProx and 20 epochs in the different experimental scenarios.

0 25 50 75 100 125 150 175 200

10−5

10−4

10−3

10−2

10−1

100

101

102MNIST-iid

1fre

e-rid

er







0 50 100 150 200 250 300

10−1

100

101

102

MNIST-shard

0 20 40 60 80 100

100

Shakespeare

0 50 100 150 200 250 300 350 400

10−5

10−3

10−1

101

5fre

e-rid

ers

0 100 200 300 400 500 600

10−5

10−4

10−3

10−2

10−1

100

101

102

0 25 50 75 100 125 150 175 200

100

0 100 200 300 400 500 600

10−5

10−3

10−1

101

45fre

e-rid

ers

0 100 200 300 400 500 600 700 800 900

10−5

10−3

10−1

101

0 50 100 150 200 250 300

10−1

100

Figure 10: Loss performances for FedProx and 5 epochs in the different experimental scenarios.

Documents

Free-rider Attacks on Model Aggregation in Federated Learning