Locally Private Bayesian Inference for Count Modelsaschein/ScheinWuZhouWallach... · 2019. 6. 6. · 2) topic modeling for text corpora. In order to formulate our local-privacy guarantees,

Locally Private Bayesian Inference for Count Models

Aaron Schein 1 Zhiwei Steven Wu 2 Mingyuan Zhou 3 Hanna Wallach 2

AbstractAs more aspects of social interaction are digi-tally recorded, there is a growing need to developprivacy-preserving data analysis methods. So-cial scientists will be more likely to adopt thesemethods if doing so entails minimal change totheir current methodology. Toward that end, wepresent a general and modular method for privatiz-ing Bayesian inference for Poisson factorization,a broad class of models that contains some of themost widely used models in the social sciences.Our method satisfies local differential privacy,which ensures that no single centralized serverneed ever store the non-privatized data. To for-mulate our local-privacy guarantees, we introduceand focus on limited-precision local privacy—thelocal privacy analog of limited-precision differen-tial privacy (Flood et al., 2013). We present twocase studies, one involving social networks andone involving text corpora, that test our method’sability to form the posterior distribution over la-tent variables under different levels of noise, anddemonstrate our method’s utility over a naıve ap-proach, wherein inference proceeds as usual, treat-ing the privatized data as if it were not privatized.

1. IntroductionData from social processes often take the form of discrete ob-servations (e.g., edges in a social network, word tokens in anemail) and these observations often contain sensitive infor-mation about the people involved. As more aspects of socialinteraction are digitally recorded, the opportunities for socialscientific insights grow; however, so too does the risk of un-acceptable privacy violations. As a result, there is a growingneed to develop privacy-preserving data analysis methods.

In practice, social scientists will be more likely to adoptthese methods if doing so entails minimal change to their

1University of Massachusetts Amherst 2Microsoft Re-search, New York 3University of Texas at Austin. Cor-respondence to: Aaron Schein <[email protected]>,Zhiwei Steven Wu <[email protected]>, MingyuanZhou <[email protected]>, Hanna Wallach<[email protected]>.

current methodology. Toward that end, under the frameworkof differential privacy (Dwork et al., 2006), we present amethod for privatizing Bayesian inference for Poisson fac-torization (Titsias, 2008; Cemgil, 2009; Zhou & Carin, 2012;Gopalan & Blei, 2013; Paisley et al., 2014), a broad class ofmodels for learning latent structure from discrete data. Thisclass contains some of the most widely used models in thesocial sciences, including topic models for text corpora (Bleiet al., 2003; Buntine & Jakulin, 2004; Canny, 2004), geneticpopulation models (Pritchard et al., 2000), stochastic blockmodels for social networks (Ball et al., 2011; Gopalan &Blei, 2013; Zhou, 2015), and tensor factorization for dyadicdata (Welling & Weber, 2001; Chi & Kolda, 2012; Schmidt& Morup, 2013; Schein et al., 2015; 2016b); it furtherincludes deep hierarchical models (Ranganath et al., 2015;Zhou et al., 2015), dynamic models (Charlin et al., 2015;Acharya et al., 2015; Schein et al., 2016a), and many others.Our method is general and modular, allowing social scien-tists to build on (instead of replace) their existing derivationsand implementations of non-private Poisson factorization.To derive our method, we rely on a novel reinterpretationof the geometric mechanism (Ghosh et al., 2012), as wellas a previously unknown general relationship betweenthe Skellam (Skellam, 1946), Bessel (Yuan & Kalbfleisch,2000), and Poisson distributions; we note that these newresults may be of independent interest in other contexts.

Our method satisfies a strong variant of differential privacy—i.e., local privacy—under which the sensitive data is priva-tized (or noised) via a randomized response method beforeinference. This ensures that no single centralized serverneed ever store the non-privatized data—a condition that isnon-negotiable in many real-world settings. The key chal-lenge introduced by local privacy is how to infer the latentvariables (including model parameters) given the privatizeddata. One option is a naıve approach, wherein inference pro-ceeds as usual, treating the privatized data as if it were notprivatized. In the context of maximum likelihood estimation,the naıve approach has been shown to exhibit pathologieswhen observations are discrete or count-valued; researchershave therefore advocated for treating the non-privatized ob-servations as latent variables to be inferred (Yang et al.,2012; Karwa et al., 2014; Bernstein et al., 2017). We em-brace this approach and extend it to Bayesian inference,where our aim is to form the posterior distribution over the

arX

iv:1

803.

0847

1v1

[sta

t.ML]

22

Mar

201

8


word type v

docum

ent

d

True topics �⇤k

Non-private estimate

ˆ�k

word type v

(a) No noise.

word type v

docum

ent

d

word type v

Our estimate

ˆ�k

Naıve estimate

ˆ�k

(b) Low noise.

word type v

docum

ent

d

word type v

Naıve estimate

ˆ�k

Our estimate

ˆ�k

Naıve estimate

ˆ�k

(c) High noise.

Figure 1. Topic recovery: our method vs. the naıve approach. (a) We generated the non-privatized data synthetically so that the truetopics were known. We then privatized the data using (b) a low noise level and (c) a high noise level. The heatmap in each subfigurevisualizes the data, using red to denote positive counts and blue to denote negative counts. With a high noise level, the naıve approachoverfits the noise and therefore fails to recover the true topics. We describe this experiment in more detail in section 5.2.

latent variables conditioned on the privatized data and therandomized response method; our method is asymptoticallyguaranteed to draw samples from this posterior distribution.

We present two case studies applying our method to 1)overlapping community detection in social networks and2) topic modeling for text corpora. In order to formulateour local-privacy guarantees, we introduce and focus onlimited-precision local privacy—the local privacy analog oflimited-precision differential privacy, originally proposedby Flood et al. (2013). For each case study, we report a suiteof experiments that test our method’s ability to form theposterior distribution over latent variables under differentlevels of noise. These experiments also demonstrate theutility of our method over the naıve approach for both casestudies; we provide an illustrative example in figure 1.

2. Background and problem formulationDifferential privacy. Differential privacy (Dwork et al.,2006) is a rigorous privacy criterion that guarantees that nosingle observation in a data set will have a significant influ-ence on the information obtained by analyzing that data set.Definition 2.1. A randomized algorithm A(·) satisfies ✏-differential privacy if for all pairs of neighboring data setsY and Y

0 that differ in only a single observation

P (A(Y ) 2 S) e

✏

P (A(Y

0) 2 S) (1)

for all subsets S in the range of A(·).

Local differential privacy. We focus on local differentialprivacy, which we refer to as local privacy. In this setting,the observations remain private from even the data analysisalgorithm. The algorithm only sees privatized versions ofthe observations, often constructed by adding noise from

specific distributions. The process of adding noise is knownas randomized response—a reference to survey-samplingmethods originally developed in the social sciences priorto the development of differential privacy (Warner, 1965).Definition 2.2. A randomized response method R(·) is✏-private if for all pairs of observations y, y

0 2 Y

P (R(y) 2 S) e

✏

P (R(y

0) 2 S) (2)

for all subsets S in the range of R(·). If a data analysisalgorithm sees only the observations’ ✏-private responses,then the data analysis itself satisfies ✏-local privacy.

Limited-precision local privacy. Definition 2.2 requiresthat condition 2 hold for all pairs of observations y, y

0 2 Y .In practice, this is notoriously difficult to achieve when Yis extremely large, meaning that any pair of observationsmay be arbitrarily different, as is often the case with datafrom social processes. We therefore introduce and focus onlimited-precision local privacy—the local privacy analog oflimited-precision differential privacy, originally proposedby Flood et al. (2013) and subsequently used to privatizeanalyses of geographic location data (Andres et al., 2013)and financial network data (Papadimitriou et al., 2017). Al-though limited-precision local privacy is weaker than localprivacy, it can still provide reasonably strong guarantees.Definition 2.3. If N is a positive integer, then a randomizedresponse method R(·) is (N, ✏)-private if for all pairs ofobservations y, y

0 2 Y such that ky � y

0k1 N

P (R(y) 2 S) e

✏

P (R(y

0) 2 S) (3)

for all subsets S in the range of R(·). If a data analysis algo-rithm sees only the observations’ (N, ✏)-private responses,then the data analysis itself satisfies (N, ✏)-limited-precisionlocal privacy. If kyk1 N for all y 2 Y , then (N, ✏)-limited-precision local privacy implies ✏-local privacy.


Geometric mechanism. There are several standardrandomized response methods in the differential privacytoolbox, many of which involve adding independentlygenerated noise to each element of each observation. Unfor-tunately, the most commonly used noise mechanisms—theGaussian and Laplace mechanisms—are poor choices forcount data because they involve real-valued distributions.We therefore focus on the geometric mechanism (Ghoshet al., 2012), which can be viewed as the discrete analogof the Laplace mechanism. The geometric mechanism addsnoise drawn from a two-sided geometric distribution toeach element of each observation. A two-sided geometricrandom variable ⌧ ⇠ 2Geo(↵) is an integer ⌧ 2 Z. ThePMF for the two-sided geometric distribution is as follows:

2Geo(⌧ ;↵) =

1 � ↵

1 + ↵

↵

|⌧ |. (4)

Theorem 2.4. (Proof in appendix.) If N is a positive inte-ger and randomized response method R(·) is the geometricmechanism with parameter ↵, then for any pair of obser-vations y, y

0 2 Y such that ky � y

0k1 N , R(·) satisfies

P (R(y) 2 S) e

✏

P (R(y

0) 2 S) (5)

for all subsets S in the range of R(·), where

✏ = N ln

⇣

1

↵

⌘

. (6)

Therefore, the geometric mechanism with parameter ↵ isan (N, ✏)-private randomized response method with ✏ =

N ln (

1↵

). If a data analysis algorithm sees only the obser-vations’ (N, ✏)-private responses, then the data analysisitself satisfies (N, ✏)-limited precision local privacy.

Differentially Private Bayesian inference. In Bayesianstatistics, we begin with a probabilistic model M thatrelates observable variables Y to latent variables Z viaa joint distribution PM(Y, Z). The goal of inference isthen to compute the posterior distribution PM(Z | Y ) overthe latent variables conditioned on observed values ofY . The posterior is almost always analytically intractableand thus inference involves approximating it. The twomost common methods of approximate Bayesian inferenceare variational inference, wherein we fit the parametersof an approximating distribution Q(Z | Y ), and Markovchain Monte Carlo (MCMC), wherein we approximate theposterior with a finite set of samples {Z

(s)}S

s=1 generatedvia a Markov chain whose stationary distribution is the exactposterior. We can conceptualize each of these methods asa randomized algorithm A(·) that returns an approximationto the posterior distribution PM(Z | Y ); in general A(·)does not satisfy ✏-differential privacy. However, if A(·)is an MCMC algorithm that returns a single sample fromthe posterior, it guarantees privacy (Dimitrakakis et al.,2014; Wang et al., 2015; Foulds et al., 2016). Adding noise

to posterior samples can also guarantee privacy (Zhanget al., 2016), though this set of noised samples { ˜

Z

(s)}S

s=1

collectively approximate some distribution ˜

PM(Z | Y )

that depends on ✏ and is different than the exact posterior(but close, in some sense, and equal when ✏ ! 0). Forspecific models, we can also noise the transition kernel ofthe MCMC algorithm to construct a Markov chain whosestationary distribution is again not the exact posterior, butsomething close that guarantees privacy (Foulds et al.,2016). We can also take an analogous approach to privatizevariational inference, wherein we add noise to the sufficientstatistics computed in each iteration (Park et al., 2016).

Locally private Bayesian inference. We first formalizethe general objective of Bayesian inference under local pri-vacy. Given a generative model M for non-privatized dataY and latent variables Z with joint distribution PM(Y, Z),we further assume a randomized response method R(·) thatgenerates privatized data sets: ˜

Y ⇠ PR(

˜

Y | Y ). The aim ofBayesian inference is then to form the following posterior:

PM,R(Z | ˜

Y ) =

PR(Y | Y ) [PM(Z | Y )]

=

Z

PM(Z | Y ) PR(Y | ˜

Y ) dY. (7)

This distribution correctly characterizes our uncertaintyabout the latent variables Z, conditioned on all of ourobservations and assumptions—i.e., the privatized data ˜

Y ,the model M, and the randomized response method R. Theexpansion in equation 7 shows that this posterior implicitlytreats the non-privatized data Y as a latent variable andmarginalizes over it using the mixing distribution PR(Y | ˜

Y )

which is itself a posterior that characterizes our uncertaintyabout Y given ˜

Y and the randomized response method. Thekey observation here is that if we can generate samples fromPR(Y | ˜

Y ), we can use them to approximate the expectationin equation 7, assuming that we already have a method forapproximating the non-private posterior PM(Z | Y ). In thecontext of MCMC, alternating between sampling values ofthe non-privatized data from its complete conditional—i.e.,Y

(s) ⇠ PM,R(Y | Z(s�1),

˜

Y )—and sampling valuesof the latent variables—i.e., Z

(s) ⇠ PM(Z | Y (s))—

constitutes a Markov chain whose stationary distributionis PM,R(Z, Y | ˜

Y ). In scenarios where we already havederivations and implementations for sampling fromPM(Z | Y ), we need only be able to sample efficientlyfrom PM,R(Y | Z,

˜

Y ) in order to obtain a locally privateBayesian inference algorithm; whether we can do thisdepends heavily on our assumptions about M and R.

We note that the objective of Bayesian inference underlocal privacy, as defined in equation 7, is similar to thatof Williams & McSherry (2010), who identify their keybarrier to inference as being unable to analytically form the


marginal likelihood that links the privatized data to Z:

PM,R(

˜

Y | Z) =

Z

PR(

˜

Y | Y ) PM(Y | Z) dY. (8)

In the next sections, we show that if M is a Poisson factor-ization model and R is the geometric mechanism, then wecan analytically form this marginal likelihood and derive anefficient MCMC algorithm that is asymptotically guaranteedto generate samples from the posterior in equation 7.

3. Locally private Poisson factorizationPoisson factorization. We assume that Y is a count-valueddata set. We further assume that each count y

n

2 + inthis data set is an independent Poisson random variabley

n

⇠Pois(µn

), where the count’s latent rate parameter µ

n

is a function of the latent variables Z. This class of modelsis known as Poisson factorization and, as described in sec-tion 1, includes many widely used models in social science.For example, the mixed-membership stochastic block modelfor social networks (Ball et al., 2011; Gopalan & Blei, 2013;Zhou, 2015) corresponds to the case where Y is a V ⇥ V

count matrix; n = (i, j), where i, j 2 [V ]; Z = {⇥, ⇧};⇥ and ⇧ are V ⇥ C and C ⇥ C non-negative, real-valuedmatrices, respectively; and µ

ij

=

P

C

c=1

P

C

d=1 ✓ic

✓

jd

⇡

cd

.Similarly, latent Dirichlet allocation (Blei et al., 2003)—awell-known topic model for text corpora—correspondsto the case where Y is a D ⇥ V count matrix; n = (d, v),where d 2 [D] and v 2 [V ]; Z = {⇥, �}, where ⇥ and �

are D ⇥ K and K ⇥ V non-negative, real-valued matrices,respectively; and µ

dv

=

P

K

k=1 ✓dk

�

kv

. In both cases, itis standard to assume independent gamma priors over theelements of the latent matrices that comprise Z; doing sofacilitates efficient Bayesian inference of these matrices viagamma–Poisson conjugacy (when conditioned on Y ).

Geometric mechanism. We focus on the geometric mech-anism (Ghosh et al., 2012) because it is a natural choice forcount data. By reinterpreting the geometric mechanism asinvolving Skellam noise and deriving a general relationshipbetween the Skellam, Bessel, and Poisson distributions, weare able to obtain analytic tractability and efficient Bayesianinference while also maintaining local privacy guarantees.In particular, we show that augmenting our model withauxiliary variables �

n

= (�

n1,�n2) allows us to analyti-cally form the marginal likelihood PM,R(y

n

| µn

, �n

) andsample efficiently from PM,R(y

n

| yn

, µ

n

, �n

), as desired.

Each non-privatized count y

n

is generated by our modelM—i.e., y

n

⇠Pois(µn

)—and then privatized as follows:

⌧

n

⇠ 2Geo(↵), y

(±)n

:= y

n

+ ⌧

n

. (9)

We use (±) to emphasize that unlike y

n

(which must benon-negative) y

(±)n

2 may be non-negative or negative.

Theorem 3.1. (Proof in appendix.) A two-sided geometricrandom variable ⌧ ⇠ 2Geo(↵) can be generated as follows:

�1,�2 ⇠ Exp(

↵

1�↵

), ⌧ ⇠ Skel(�1,�2), (10)

where the Skellam distribution is the marginal distributionover the difference ⌧ := g1�g2 of two independent Poissonrandom variables g1 ⇠ Pois(�1) and g2 ⇠ Pois(�2).

Via theorem C.1, we can express the generative process fory

(±)n

in three equivalent ways, shown in figure 2, each ofwhich provides a unique and necessary insight. The firstway (process 1) is useful for showing that our MCMC algo-rithm guarantees privacy, since two-sided geometric noise isan existing privacy mechanism. The second way (process 2)represents the two-sided geometric noise in terms of a pairof Poisson random variables with exponentially distributedrates; in so doing, it reveals the auxiliary variables thatfacilitate inference. The third way (process 3) marginalizesout all three Poisson random variables (including y

n

), sothat y

(±)n

is directly drawn from a Skellam distribution,which also happens to be the desired marginal likelihoodPM,R(y

n

| µn

, �n

) under the geometric mechanism. Toderive the second and third ways, we use theorem C.1,the definition of the Skellam distribution, and the additiveproperty of two or more Poisson random variables.

4. MCMC algorithmWe now rely on a previously unknown general relation-ship between the Skellam, Bessel, and Poisson distribu-tions to derive an efficient way to draw samples fromPM,R(y

n

| yn

, µ

n

, �n

). As explained in section 2, this is allwe need to obtain an locally private MCMC algorithm fordrawing samples of the latent variables given the privatizeddata, provided we already have a way to draw samples ofthe latent variables given the non-private data. The input tothis MCMC algorithm is the privatized data set ˜

Y

(±).

Theorem 4.1. (Proof in appendix.) Consider two Poissonrandom variables y1 ⇠ Pois(�1) and y2 ⇠ Pois(�2).Their minimum m := min{y1, y2} and their difference� := y1 � y2 are deterministic functions of y1 and y2.However, if not conditioned on y1 and y2, the randomvariables m and � can be marginally generated as follows:

� ⇠ Skel(�1,�2), m ⇠ Bes⇣

|�|, 2p

2�1�2

⌘

. (11)

Yuan & Kalbfleisch (2000) give details of the Bessel distri-bution, which can be sampled efficiently (Devroye, 2002).1

Lemma D.1 means that we can generate two independentPoisson random variables by first generating their difference

1We have released our implementation of Bessel sampling. Itis the only open-source version of which we are aware.


Process 1

—⌧

n

⇠ 2Geo(↵)

y

n

⇠ Pois(µn

)

y

(±)n

:= y

n

+ ⌧

n

Process 2

�

n1,�n2 ⇠ Exp(

↵

1�↵

)

g

nl

⇠ Pois(�nl

) for l 2 {1, 2}y

n

⇠ Pois(µn

)

y

(±)n

:= y

n

+ g

n1 � g

n2

Process 3

�

n1,�n2 ⇠ Exp(

↵

1�↵

)

——

y

(±)n

⇠ Skel(�n1+µ

n

, �

n2)

Figure 2. Three equivalent ways to generate y(±)n .

� and then their minimum m. Because � = y1 � y2, if � ispositive, then y2 must be the minimum and thus y1 = ��m.In practice, this means that if we only get to observe thedifference of two Poisson-distributed counts, we can still“recover” the counts by drawing a Bessel random variable.

Assuming that y

(±)n

⇠ Skel(�n1 + µ

n

,�

n2) via theo-rem C.1, we can represent y

(±)n

explicitly as the differencebetween two latent non-negative counts: y

(±)n

= y

(+)n

�g

n2.We can then define the minimum of these latent countsto be m

n

= min{y

(+)n

, g

n2}. Given randomly initializedlatent variables, we can then sample a value of m

n

fromits conditional posterior, which is a Bessel distribution:

�

m

n

| ��

⇠ Bes⇣

|y(±)n

|, 2

p

(�

n1+µ

n

)�

n2

⌘

. (12)

Using this value, we can then compute y

(+)n

and g

n2:

y

(+)n

:= m

n

, g

n2 := y

(+)n

� y

(±)n

if y

(±)n

0 (13)

g

n2 := m

n

, y

(+)n

:= g

n2 + y

(±)n

otherwise. (14)

Because y

(+)n

is the sum of y

n

and g

n1—two independentPoisson random variables—we can then sample y

n

fromits conditional posterior, which is a binomial distribution:

�

y

n

| ��

⇠ Binom⇣

y

(+)n

,

µn

µn+�n1

⌘

(15)

Equations 12 through 15 constitute a way to draw samplesfrom PM,R(y

n

| yn

, µ

n

, �n

). Given a sampled Y , we canthen draw samples of the latent variables from their condi-tional posteriors, which are the same as in non-private Pois-son factorization. Finally, we can also sample �

n1 and �n2:

�

�

nl

| ��

⇠ �

⇣

1 + g

nl

,

↵

1�↵

+ 1

⌘

for l 2 {1, 2}. (16)

Equation 16 follows from gamma–Poisson conjugacyand the fact that the exponential prior over �

nl

can beexpressed as a gamma prior with shape parameter equal toone—i.e., �

nl

⇠ �(1,

↵1�↵ ). Equations 12–16, along with

the conditional posteriors for the latent variables, definean MCMC algorithm that is asymptotically guaranteed togenerate samples from PM,R(Z | ˜

Y

(±)) as desired.

5. Case studiesWe now present two case studies applying our method to1) overlapping community detection in social networks and2) topic modeling for text corpora. For each case study,we formulate local-privacy guarantees and ground them inillustrative examples. We then report a suite of experimentsthat test our method’s ability to form the posterior distri-bution over latent variables for different types of data underdifferent levels of noise. We focus on synthetic and semi-synthetic data to control for the effects of model mismatch(i.e., non-Poisson observations); although model mismatchis an important problem, it is outside the scope of this paper.Using synthetic and semi-synthetic data also allows us tovary high-level properties of the data (e.g., scale or sparsity).

Reference methods. We compare the performance of ourmethod to two references methods: 1) non-private Poissonfactorization on the non-privatized data and 2) non-privatePoisson factorization on the privatized data—i.e., the naıveapproach, wherein inference proceeds as usual, treating theprivatized data as if it were not privatized.2 Throughout ourexperiments, we use MCMC for both reference methods.

Performance measure. Ideally, we would directly compareour method’s posterior distribution and the naıve approach’sposterior distribution to that of non-private Poisson factor-ization on the non-privatized data. Unfortunately, all threeposteriors are analytically intractable. However, because weuse MCMC to approximate each posterior with a finite setof samples of the latent variables, we can instead form theexpected value of µ

n

with respect to each one—e.g.,

µ

n

=

1

S

S

X

s=1

µ

(s)n

⇡PM,R(Z | Y

(±)) [µ

n

] . (17)

Furthmore, because we focus on synthetic and semi-synthetic data, we can use an aggregate loss function tocompare the expected values to the values used to generatethe data: 1

N

P

N

n=1 `(µn

, µ

⇤n

), where µ

⇤n

is the “true” value.We define `(µ

n

, µ

?

n

) to be the KL divergence of the Poissondistribution implied by µ

n

from the Poisson distributionimplied by µ

?

n

. Comparing the value of this aggregate loss2The naıve approach first truncates negative counts to zero and

thus uses the truncated geometric mechanism (Ghosh et al., 2012).


True data y

ij

True µ

⇤ij

Non-private µ

ij

Our estimate µ

ij

Naıve estimate µ

ij

Noised data y

(±)ij

Increasing privacy

✏ = 2.5 ✏ = 1 ✏ = 0.75

True data y

ij

True µ

⇤ij

Non-private µ

ij

Our estimate µ

ij

Naıve estimate µ

ij

Noised data y

(±)ij

Increasing privacy

✏ = 2.5 ✏ = 1 ✏ = 0.75

Figure 3. Block structure recovery: our method vs. the naıve approach. We generated the non-privatized data synthetically. We thenprivatized the data using three different levels of noise. The top row depicts the data, using red to denote positive counts and blue to denotenegative counts. As the noise level increases, the naıve approach overfits the noise and fails to recover the true µ?

ij values, predicting highvalues even for sparse parts of the matrix. In contrast, our method recovers the latent structure, even for high noise levels.

function for our method and the value for naıve approachto the value for non-private Poisson factorization provides aproxy for measuring the divergence of their posterior distri-butions from that of non-private Poisson factorization.

5.1. Case study 1: Overlapping community detection

Organizations often want to know whether their employeesare interacting as efficiently and productively as possible.For example, are there missing connections between employ-ees that, if present, would significantly reduce duplicationof effort? Do the natural “communities” that emerge fromdigitally recorded employee interactions match up with theformal organizational structure? To answer these and otherquestions, many organizations want to partner with socialscientists in order to gain actionable insights based on theiremployees’ interactions. However, sharing such interactiondata increases the risk of privacy violations. Moreover, stan-dard anonymization procedures can be reverse-engineeredadversarially and thus do not provide privacy guaran-tees (Narayanan & Shmatikov, 2009). In contrast, the formalprivacy guarantees provided by differential privacy may besufficient for employees to consent to sharing their data.

Limited-precision local privacy. In this scenario, data setY is a V ⇥ V count matrix, where each element y

ij

2 +

in this matrix is the number of interactions from actor i 2 V

to actor j 2 [V ]. A single observation in this data set is asingle element. Via theorem 2.4, y

(±)ij

:= y

ij

+ ⌧

ij

, where⌧

ij

⇠ 2Geo(↵), is (N, ✏)-private, where N is the precisionlevel and ✏ = N ln

�

1↵

�

. Informally, this means that if thedifference between two observations is N or less, then their

privatized versions will be indistinguishable, provided ✏ issufficiently small. Furthermore, if y

ij

N , then its priva-tized version will be indistinguishable from the privatizedversion of y

ij

= 0. For example, if i interacted with j threetimes (i.e., y

ij

= 3) and N = 3, then an adversary wouldbe unable to tell from y

(±)ij

whether i had interacted withj at all, provided ✏ is sufficiently small. We note that ify

ij

� N , then an adversary would be able to tell that i hadinteracted with j, though not the exact number of times.

Poisson factorization. As explained in section 3, the mixed-membership stochastic block model for learning latent over-lapping community structure in social networks (Ball et al.,2011; Gopalan & Blei, 2013; Zhou, 2015) is a special case ofPoisson factorization where Y is a V ⇥V count matrix; n =

(i, j), where i, j 2 [V ]; Z = {⇥, ⇧}; ⇥ and ⇧ are V ⇥ C

and C ⇥ C non-negative, real-valued matrices, respectively;and µ

ij

=

P

C

c=1

P

C

d=1 ✓ic

✓

jd

⇡

cd

. The factors ✓ic

and ✓jd

represent how much actors i and j participate in communi-ties c and d, respectively, while the factor ⇡

cd

represents howmuch actors in community c interact with actors in commu-nity d. It is standard to assume independent gamma priorsover the factors—i.e., ✓

ic

,⇡

cd

⇠ Gamma(a0, b0), where a0

and b0 are shape and rate hyperparameters, respectively.

Synthetic data. We generated social networks of V = 20

actors with C = 5 communities. We randomly generatedthe true parameters ✓⇤

ic

,⇡

⇤cd

⇠�(a0, b0) by setting a0 =0.01

and b0 =0.5 to encourage sparsity; doing so exaggerates theblock structure in the network. We then sampled a data sety

ij

⇠ Pois(µ⇤ij

) and noised it ⌧ij

⇠ 2Geo(↵) for three in-creasing values of ↵. Since the magnitude of the counts y

ij


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

� = exp(� �N )

0

2

4

6

8

10

12

KL

dive

rgen

ce

Network sparsity = 14% non-zero edgesModel typeOur method

Naıve method

Non-private

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

� = exp(� �N )

0

1

2

3

4

5

6

7

8

KL

dive

rgen

ce


Naıve method

Non-private

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

� = exp(� �N )

0

1

2

3

4

5

6

7

8

KL

dive

rgen

ce


Naıve method

Non-private

Figure 4. Mean KL divergence of the Poisson distribution implied by µij from the Poisson distribution implied by µ⇤ij ; lower is better.

Error bars denote standard deviation across five replications. Each subfigure reports the results across nine values of ↵ (higher valuesmean more noise) for a given setting of e0 which controls the sparsity and magnitude of the count observations. As the counts grow largerand denser, the performance of both private methods approaches the performance of non-private Poisson factorization for small values of↵. Our method almost always outperforms the naıve approach, with the difference being especially stark at higher noise levels.

varied across trials, we allowed the three values of ↵ to varyby setting the precision to the empirical mean of the dataN :=

ˆ

[y

ij

] and setting ↵ := exp(�✏/N) for three valuesof ✏ 2 {2.5, 1, 0.75}. For each model, we ran 8,500 sam-pling iterations, saving every 25

th sample after the first 1,000and using these samples to compute µ

ij

, as given in equa-tion 17. In figure 3, we visually compare the estimates ofµ

ij

by our method and the naıve approach, under the threedifferent noise levels, to the estimate by the non-privatemethod and to the true values µ

⇤ij

. We see that the naıveapproach overfits the noise, predicting high rates in sparseparts of the matrix. In contrast, our method maintains a pre-cise representation of the data even under high noise levels.

Enron. We processed the Enron corpus (Klimt & Yang,2004) to obtain a V ⇥V adjacency matrix Y where y

ij

is thenumber of emails sent from actor i to actor j. We includedan actor if they sent at least one email and sent or received atleast one hundred emails, yielding V = 161 actors. Whenan email included multiple recipients, we incremented thecorresponding counts by one. We fit non-private Poisson fac-torization to the data Y and obtained point estimates of thefactors ✓⇤

ic

and ⇡⇤cd

. We then generated semi-synthetic datasets y

ij

⇠ Pois(µ⇤ij

) where we fixed ⇡⇤cd

but re-scaled ✓⇤ic

=

�

i

✓

⇤ic

allowing �i

to vary. Rescaling changes the overallinteraction rate of actor i without changing their relative fre-quency of participation across communities; this allows usto vary the scale and sparsity of the network while maintain-ing realistic community block structure. We randomly gen-erated �

i

⇠ �(e0f0, f0) for f0 = 0.1 and varying e0. Underthis parameterization of the gamma distribution the mean is

[�

i

] = e0; this allows us to increase the scale and densityof the data. We generated five semi-synthetic data sets forthree values of e0 2 {0.75, 1, 1.5} and nine levels of noise↵ 2 {0.1, . . . , 0.9}. We applied our method and the naıveapproach to each noised data set by running 15,000 sam-pling iterations and saving every 25

th sample after the first3,000 to compute µ

ij

. In figure 4, we report the mean KL

divergence of the Poisson distribution implied by µ

ij

fromthe Poisson distribution implied by µ

⇤ij

for both models.

5.2. Case study 2: Topic modeling

Topic models are in widely used in the social sciencesfor learning latent topics (i.e., probability distributionsover some vocabulary) from text corpora, often to char-acterize high-level thematic structure (e.g., Ramage et al.,2009; Grimmer & Stewart, 2013; Mohr & Bogdanov, 2013;Roberts et al., 2013). In many settings, these corpora con-tain sensitive information about the people involved (e.g.,emails, survey responses). As a result, people may be unwill-ing to consent to sharing their data without formal privacyguarantees, such as those provided by differential privacy.

Limited-precision local privacy. In this scenario, data setY is a D ⇥ V count matrix, where each element y

dv

2 +

in this matrix is the number of times word type v 2 [V ]

occurred in document d 2 [D]. Similar to the community-detection scenario, a single observation in this data setmight correspond to a single element. In this case, ify

dv

N , an adversary would be unable to tell from y

(±)dv

whether v occurred in d, provided ✏ is sufficiently small.However, a more natural interpretation would be to as-sume that a single observation is an entire document—i.e.,y

d

= (y

d1, . . . , ydV

). In this case, if the `1 norm of the dif-ference between two documents is N or less, then their pri-vatized versions will be indistinguishable, provided ✏ is suf-ficiently small. For example, if N = 4, then the privatizedversion of email that includes the sentence “I hate my boss”will be indistinguishable from that of an email without thesentence. We note that it is also natural to consider heteroge-neous document-specific precision levels—i.e., N

d

leadingto ✏

d

= N

d

ln

⇣

1↵d

⌘

—to enable the author of document d tochoose how much to noise this document before sharing it.For example, if an author wanted to make sure that an adver-


sary would be unable to tell that she wrote “surprise party”five times in an email, she would first set N

d

:= 5 · 2 = 10.Then, to achieve ✏

d

= 1, she would set y

(±)dv

:= y

dv

+ ⌧

dv

,where ⌧

dv

⇠ 2Geo(↵

d

) and ↵d

= exp (� ✏dNd

) ⇡ 0.9.

Poisson factorization. As explained in section 3, latentDirichlet allocation (Blei et al., 2003)—a well-known topicmodel for text corpora—is a special case of Poisson factor-ization where Y is a D⇥V count matrix; n = (d, v), whered 2 [D] and v 2 [V ]; Z = {⇥, �}, where ⇥ and � areD ⇥ K and K ⇥ V non-negative, real-valued matrices, re-spectively; and µ

dv

=

P

K

k=1 ✓dk

�

kv

. The factor ✓dk

repre-sents how much topic k is used in document d, while the fac-tor �

kv

represents how much word type v is used in topic k.Again, it is standard to assume independent Gamma priorsover the factors—i.e., ✓

dk

,�

kv

⇠ Gamma(a0, b0), wherea0 and b0 are shape and rate hyperparameters, respectively.

Synthetic data. We generated a synthetic data set of D =

90 documents, with K =3 topics and V = 15 word types.We set �

⇤ so that the topics were well separated, with eachputting the majority of its mass on five different word types.We also ensured that the documents were well separated intothree equal groups of thirty, with each putting the majorityof its mass on a different topic. We then sampled a data sety

⇤dv

⇠ Pois(µ⇤dv

) where µ

⇤dv

=

P

K

k=1 ✓⇤dk

�

⇤kv

. We then gen-erated a heterogeneously-noised data set by sampling the d

th

document’s noise level ↵d

⇠ Beta�

c↵0, c (1�↵0)�

from aBeta distribution with mean ↵0 and concentration parameterc = 10 and then sampling ⌧

dv

⇠ 2Geo(↵

d

) for each wordtype v. We repeated this for a small and large value of ↵0.For each model, we ran 6,000 sampling iterations, savingevery 25

th sample after the first 1,000. We selected ˆ

� to befrom the posterior sample with the highest joint probability.Note that, due to label-switching, we cannot average thesamples of �. Following Newman et al. (2009), we thenaligned the topic indices of ˆ

� to �

⇤ using the Hungarianbipartite matching algorithm. We visualize the results infigure 1 where we see that the naıve approach performspoorly at recovering the topics in the high noise case.

Enron. For these experiments we created a corpus by treat-ing each sent email in the Enron corpus as a single document.After removing stopwords we ran latent Dirichlet alloca-tion (LDA)—a special case of Poisson factorization—usingMALLET (McCallum, 2002) with default settings to obtaina point estimate of the parameters ✓⇤

dk

and �⇤kv

. For each ofthe most frequently used K =25 topics, we sub-selected thetop 50% of word types v used by the topic and top 5% doc-uments d that use the topic. The resultant data set includedD=1, 000 documents and V =1, 400 word types. As in theprevious study, we used a semi-synthetic experimental de-sign to allow us to vary the length and density of documentsand control for model mismatch. In each trial, we generateda data set y

dv

⇠ Pois(µ⇤dv

) where we rescaled �⇤kv

= �

v

�

⇤kv

and ✓⇤dk

= �

d

✓

⇤dk

, allowing �v

and �d

to vary across tri-als. We randomly generated �

d

, �

v

⇠ �(e0f0, f0) for f0 =

0.001 and varying e0. We generated five semi-synthetic datasets for three values of e0 2 {5, 10, 50} and nine levels ofnoise ↵ 2 {0.1, . . . , 0.9}. We applied our method and thenaıve approach to each data set by running 15,000 samplingiterations and saving every 25

th sample after the first 3,000.Using these samples we approximated the posterior mean ofµ

dv

and calculated the mean KL divergence of the Poissondistribution implied by µ

dv

from the Poisson distributionimplied by µ

⇤dv

. We include a plot of these results in theappendix; they tell a similar story to those shown in figure 4.

6. Conclusion and future directionsWe presented a general and modular method for privatiz-ing Bayesian inference for Poisson factorization, a broadclass of models that contains some of the most widely usedmodels in the social sciences. Our method satisfies localdifferential privacy. To formulate our local-privacy guar-antees, we introduced limited-precision local privacy—thelocal privacy analog of limited-precision differential privacy.Finally, via two case studies, we demonstrated our method’sutility over a naıve approach, wherein inference proceeds asusual, treating the privatized data as if it were not privatized.

The key to our method is being able to efficiently samplevalues of the non-privatized data. We accomplish thisby introducing auxiliary variables and exploiting specialrelationships between the Bessel, Skellam, and Poissondistributions to obtain a sequence of closed-form condi-tional distributions for every variable. A straightforwardalternative to our approach is to sample from the unnormal-ized density PM,R(y

n

, y

(±)n

| µ

n

) / PM,R(y

n

|y(±)n

, µ

n

)

using black-box techniques (e.g., rejection sampling). Al-though this is conceptually simpler, there are many benefitsof closed-formedness. Most importantly, our algorithmcan be easily built on in the future to develop stochasticinference algorithms for massive data sets. When allcomplete conditionals are closed-form exponential families,there are simple recipes for translating MCMC algorithmsinto coordinate-ascent variational inference (CAVI) algo-rithms (Hoffman et al., 2013). For our method, all completeconditionals are well known to be exponential families,except for the Bessel distribution; however, we show that itis in theorem E below. This allows us to derive a full CAVIalgorithm—we include the derivation of this algorithm inthe appendix and leave for future work the development ofa stochastic version that will scale to massive data sets.

Theorem 6.1. (Proof in appendix.) The Bessel distributionm ⇠ Bes(⌫, a) for fixed ⌫ is an exponential family withsufficient statistic T

⌫

(m) = 2m + ⌫, natural parameter⌘

⌫

(m)=log

�

a

2

�

, and base measure h

⌫

(m)=

1m! �(m+⌫+1) .


ReferencesAcharya, Ayan, Ghosh, Joydeep, and Zhou, Mingyuan. Non-

parametric Bayesian factor analysis for dynamic countmatrices. arXiv:1512.08996, 2015.

Andres, Miguel E., Bordenabe, Nicolas E., Chatzikoko-lakis, Konstantinos, and Palamidessi, Catuscia. Geo-indistinguishability: Differential privacy for location-based systems. In Proceedings of the 2013 ACM SIGSACConference on Computer & Communications Secu-rity, CCS ’13, pp. 901–914, New York, NY, USA, 2013.ACM. ISBN 978-1-4503-2477-9. doi: 10.1145/2508859.2516735. URL http://doi.acm.org/10.1145/

2508859.2516735.

Ball, Brian, Karrer, Brian, and Newman, Mark E. J. Effi-cient and principled method for detecting communities innetworks. Physical Review E, 84(3):036103, 2011.

Bernstein, Garrett, McKenna, Ryan, Sun, Tao, Sheldon,Daniel, Hay, Michael, and Miklau, Gerome. Differ-entially private learning of undirected graphical mod-els using collective graphical models. arXiv preprintarXiv:1706.04646, 2017.

Blei, David M., Ng, Andrew Y., and Jordan, Michael I.Latent Dirichlet allocation. Journal of Machine LearningResearch, 3:993–1022, 2003.

Braun, Michael and McAuliffe, Jon. Variational inferencefor large-scale models of discrete choice. Journal ofthe American Statistical Association, 105(489):324–335,2010.

Buntine, Wray and Jakulin, Aleks. Applying discrete PCAin data analysis. In Proceedings of the 20th Conferenceon Uncertainty in Artificial Intelligence, pp. 59–66, 2004.

Canny, John. GaP: a factor model for discrete data. InProceedings of the 27th Annual International ACM SIGIRConference on Research and Development in InformationRetrieval, pp. 122–129, 2004.

Cemgil, Ali Taylan. Bayesian inference for nonnegativematrix factorisation models. Computational Intelligenceand Neuroscience, 2009.

Charlin, Laurent, Ranganath, Rajesh, McInerney, James,and Blei, David M. Dynamic Poisson factorization. InProceedings of the 9th ACM Conference on RecommenderSystems, pp. 155–162, 2015.

Chi, Eric C. and Kolda, Tamara G. On tensors, sparsity,and nonnegative factorizations. SIAM Journal on MatrixAnalysis and Applications, 33(4):1272–1299, 2012.

Devroye, Luc. Simulating Bessel random variables. Statis-tics & Probability Letters, 57(3):249–257, 2002.

Dimitrakakis, Christos, Nelson, Blaine, Mitrokotsa, Aika-terini, and Rubinstein, Benjamin I. P. Robust and privateBayesian inference. In International Conference on Algo-rithmic Learning Theory, pp. 291–305, 2014.

Dwork, Cynthia, McSherry, Frank, Nissim, Kobbi, andSmith, Adam. Calibrating noise to sensitivity in pri-vate data analysis. In Proceedings of the 3rd Theory ofCryptography Conference, volume 3876, pp. 265–284,2006.

Flood, Mark, Katz, Jonathan, Ong, Stephen, and Smith,Adam. Cryptography and the economics of supervisoryinformation: Balancing transparency and confidentiality.2013. URL https://papers.ssrn.com/sol3/

papers.cfm?abstract_id=2354038#.

Foulds, James, Geumlek, Joseph, Welling, Max, and Chaud-huri, Kamalika. On the theory and practice of privacy-preserving Bayesian data analysis. 2016.

Ghosh, Arpita, Roughgarden, Tim, and Sundararajan,Mukund. Universally utility-maximizing privacy mecha-nisms. SIAM Journal on Computing, 41(6):1673–1693,2012.

Gopalan, Prem K. and Blei, David M. Efficient discoveryof overlapping communities in massive networks. Pro-ceedings of the National Academy of Sciences, 110(36):14534–14539, 2013.

Grimmer, Justin and Stewart, Brandon M. Text as data:The promise and pitfalls fo automatic content analysismethods for political texts. Political Analysis, pp. 1–31,2013.

Hoffman, Matthew D, Blei, David M, Wang, Chong, andPaisley, John. Stochastic variational inference. The Jour-nal of Machine Learning Research, 14(1):1303–1347,2013.

Johnson, Norman L., Kemp, Adrienne W., and Kotz, Samuel.Univariate discrete distributions. 2005.

Karwa, Vishesh, Slavkovic, Aleksandra B, and Krivitsky,Pavel. Differentially private exponential random graphs.In International Conference on Privacy in StatisticalDatabases, pp. 143–155. Springer, 2014.

Klimt, Bryan and Yang, Yiming. The enron corpus: A newdataset for email classification research. In EuropeanConference on Machine Learning, pp. 217–226. Springer,2004.

McCallum, Andrew Kachites. Mallet: A machine learningfor language toolkit. 2002.

Mohr, John and Bogdanov, Petko (eds.). Poetics: TopicModels and the Cultural Sciences, volume 41. 2013.

http://doi.acm.org/10.1145/2508859.2516735

http://doi.acm.org/10.1145/2508859.2516735

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2354038#

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2354038#


Narayanan, Arvind and Shmatikov, Vitaly. De-anonymizingsocial networks. In Proceedings of the 2009 30th IEEESymposium on Security and Privacy, pp. 173–187, 2009.

Newman, David, Asuncion, Arthur, Smyth, Padhraic, andWelling, Max. Distributed algorithms for topic models.Journal of Machine Learning Research, 10(Aug):1801–1828, 2009.

Paisley, John, Blei, David M., and Jordan, Michael I.Bayesian nonnegative matrix factorization with stochas-tic variational inference. In Airoldi, Edoardo M., Blei,David M., Erosheva, Elena A., and Fienberg, Stephen E.(eds.), Handbook of Mixed Membership Models and TheirApplications, pp. 203–222. 2014.

Papadimitriou, Antonis, Narayan, Arjun, and Haeberlen,Andreas. Dstress: Efficient differentially private com-putations on distributed data. In Proceedings of theTwelfth European Conference on Computer Systems, Eu-roSys ’17, pp. 560–574, New York, NY, USA, 2017.ACM. ISBN 978-1-4503-4938-3. doi: 10.1145/3064176.3064218. URL http://doi.acm.org/10.1145/

3064176.3064218.

Park, Mijung, Foulds, James, Chaudhuri, Kamalika, andWelling, Max. Private topic modeling. arXiv:1609.04120,2016.

Pritchard, Jonathan K., Stephens, Matthew, and Donnelly,Peter. Inference of population structure using multilocusgenotype data. Genetics, 155(2):945–959, 2000.

Ramage, Daniel, Rosen, Evan, Chuang, Jason, Manning,Christopher D., and McFarland, Daniel A. Topic model-ing for the social sciences. In NIPS Workshop on Appli-cations for Topic Models, 2009.

Ranganath, Rajesh, Tang, Linpeng, Charlin, Laurent, andBlei, David. Deep exponential families. In Proceed-ings of the 18th International Conference on ArtificialIntelligence and Statistics, pp. 762–771, 2015.

Roberts, Margaret E., Stewart, Brandon M., Tingley, Dustin,and Airoldi, Edoardo M. The structural topic model andapplied social science. In NIPS Workshop on Topic Mod-els: Computation, Application, and Evaluation, 2013.

Schein, Aaron, Paisley, John, Blei, David M., and Wal-lach, Hanna. Bayesian Poisson tensor factorization forinferring multilateral relations from sparse dyadic eventcounts. In Proceedings of the 21st ACM SIGKDD Inter-national Conference on Knowledge Discovery and DataMining, pp. 1045–1054, 2015.

Schein, Aaron, Wallach, Hanna, and Zhou, Mingyuan.Poisson–gamma dynamical systems. In Advances in Neu-ral Information Processing Systems 29, pp. 5005–5013,2016a.

Schein, Aaron, Zhou, Mingyuan, Blei, David M., and Wal-lach, Hanna. Bayesian Poisson Tucker decompositionfor learning the structure of international relations. InProceedings of the 33rd International Conference on Ma-chine Learning, 2016b.

Schmidt, Mikkel N. and Morup, Morten. NonparametricBayesian modeling of complex networks: an introduction.IEEE Signal Processing Magazine, 30(3):110–128, 2013.

Skellam, John G. The frequency distribution of the differ-ence between two Poisson variates belonging to differentpopulations. Journal of the Royal Statistical Society, Se-ries A (General), 109:296, 1946.

Titsias, Michalis K. The infinite gamma–Poisson featuremodel. In Advances in Neural Information ProcessingSystems 21, pp. 1513–1520, 2008.

Ver Hoef, Jay M. Who invented the delta method? TheAmerican Statistician, 66(2):124–127, 2012.

Wang, Chong and Blei, David M. Variational inferencein nonconjugate models. Journal of Machine LearningResearch, 14(Apr):1005–1031, 2013.

Wang, Yu-Xiang, Fienberg, Stephen, and Smola, Alex. Pri-vacy for free: posterior sampling and stochastic gradientMonte Carlo. In Proceedings of the 32nd InternationalConference on Machine Learning, pp. 2493–2502, 2015.

Warner, Stanley L. Randomized response: a survey tech-nique for eliminating evasive answer bias. Journal of theAmerican Statistical Association, 60(309):63–69, 1965.

Welling, Max and Weber, Markus. Positive tensor factor-ization. Pattern Recognition Letters, 22(12):1255–1261,2001.

Williams, Oliver and McSherry, Frank. Probabilistic in-ference and differential privacy. In Advances in Neu-ral Information Processing Systems 23, pp. 2451–2459,2010.

Yang, Xiaolin, Fienberg, Stephen E, and Rinaldo, Alessan-dro. Differential privacy for protecting multi-dimensionalcontingency table data: Extensions and applications. Jour-nal of Privacy and Confidentiality, 4(1):5, 2012.

Yuan, Lin and Kalbfleisch, John D. On the Bessel distri-bution and related problems. Annals of the Institute ofStatistical Mathematics, 52(3):438–447, 2000.

Zhang, Zuhe, Rubinstein, Benjamin I. P., and Dimitrakakis,Christos. On the differential privacy of Bayesian infer-ence. In Proceedings of the 30th AAAI Conference onArtificial Intelligence, pp. 2365–2371, 2016.

http://doi.acm.org/10.1145/3064176.3064218

http://doi.acm.org/10.1145/3064176.3064218


Zhou, Mingyuan. Infinite edge partition models for over-lapping community detection and link prediction. In Pro-ceedings of the 18th Conference on Artificial Intelligenceand Statistics, pp. 1135–1143, 2015.

Zhou, Mingyuan and Carin, Lawrence. Augment-and-conquer negative binomial processes. In Advances inNeural Information Processing Systems 25, pp. 2546–2554, 2012.

Zhou, Mingyuan, Cong, Yulai, and Chen, Bo. The Poissongamma belief network. In Advances in Neural Informa-tion Processing Systems 28, pp. 3043–3051, 2015.


A. Additional figures

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

� = exp(� �N )

0

5

10

15

20

KL

dive

rgen

ce

Average document length = 25 tokensModel typeOur method

Naıve method

Non-private

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

� = exp(� �N )

0

2

4

6

8

10

12

KL

dive

rgen

ce


Naıve method

Non-private

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

� = exp(� �N )

0

5

10

15

20

KL

dive

rgen

ce


Naıve method

Non-private

Figure 5. Mean KL divergence of the Poisson distribution implied by µdv from the Poisson distribution implied by µ⇤dv; lower is better.

Error bars denote standard deviation across five replications. Each subfigure reports the results across nine values of ↵ (higher valuesmean more noise) for a given setting of e0 which controls the sparsity and magnitude of the count observations. As the counts grow largerand denser, the performance of both private methods approaches the performance of non-private Poisson factorization for small values of↵. Our method almost always outperforms the naıve approach, with the difference being especially stark at higher noise levels.

�80 �60 �40 �20 0 20 40 60 80

�

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

P(�

|�1,

�2)

Skellam distribution

�1=5, �2=5

�1=100, �2=100

�1=20, �2=1

�1=30, �2=40

�1=10, �2=50

�80 �60 �40 �20 0 20 40 60 80

�

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

P(�

|�)

Two-sided geometric distribution

�=0.90

�=0.75

�=0.50

Figure 6. The two-sided geometric distribution (bottom) can be obtained by randomizing the parameters of the Skellam distribution(top). With fixed paramters, the Skellam distribution can be asymmetric and centered at a value other than zero; however, the two-sidedgeometric distribution is symmetric and centered at zero. It is also heavy tailed and the discrete analog of the Laplace distribution.

B. Geometric mechanismTheorem 2.4. If N is a positive integer and randomized response method R(·) is the geometric mechanism with parameter↵, then for any pair of observations y, y

0 2 Y ✓ Zd such that ky � y

0k1 N , R(·) satisfies

P (R(y) 2 S) e

✏

P (R(y

0) 2 S) (18)

for all subsets S in the range of R(·), where

✏ = N ln

✓

1

↵

◆

. (19)

Therefore, the geometric mechanism with parameter ↵ is an (N, ✏)-private randomized response method with ✏ = N ln (

1↵

).If a data analysis algorithm sees only the observations’ (N, ✏)-private responses, then the data analysis itself satisfies(N, ✏)-limited precision local privacy.

Proof. It suffices to show that for any integer-valued vector o 2 Zd, the following inequality holds for any pair ofobservations y, y

0 2 Y ✓ Zd such that ky � y

0k1 N :

exp(�✏) P (R(y) = o)

P (R(y

0) = o)

exp(✏), (20)


where ✏ = N ln

�

1↵

�

.

Let ⌫ denote a d-dimensional noise vector with elements drawn independently from 2Geo(↵). Then,

P (R(y) = o)

P (R(y

0) = o)

=

P (⌫ = o � y)

P (⌫ = o � y

0)

(21)

=

Q

d

i=11�↵

1+↵

↵

|oi�yi|

Q

d

i=11�↵

1+↵

↵

|oi�y

0i|

(22)

= ↵

(

Pdi=1 |oi�yi|�|oi�y

0i|)

. (23)

By the triangle inequality, we also know that for each i,

�|yi

� y

0i

| |oi

� y

i

| � |oi

� y

0i

| |yi

� y

0i

|. (24)

Therefore,

�ky � y

0k1 d

X

i=1

(|oi

� y

i

| � |oi

� y

0i

|) ky � y

0k1. (25)

It follows that↵

�N P (R(y) = o)

P (R(y

0) = o)

↵

N

. (26)

If ✏ = N ln

�

1↵

�

, then we recover the bound in equation 20.

C. Two-sided geometric noise as exponentially randomized Skellam noiseTheorem C.1. A two-sided geometric random variable ⌧ ⇠ 2Geo(↵) can be generated as follows:

�1,�2 ⇠ Exp(

↵

1�↵

), ⌧ ⇠ Skel(�1,�2), (27)

where the Skellam distribution is the marginal distribution over the difference ⌧ := g1�g2 of two independent Poissonrandom variables g1 ⇠ Pois(�1) and g2 ⇠ Pois(�2).

Proof. A two-sided geometric random variable ⌧ ⇠ 2Geo(↵) can be generated by taking the difference of two independentand identically distributed geometric random variables:3

g1 ⇠ Geo(↵), g2 ⇠ Geo(↵), ⌧ := g1 � g2. (28)

The geometric distribution is a special case of the negative binomial distribution, with shape parameter equal to one (Johnsonet al., 2005). Furthermore, the negative binomial distribution can be represented as a mixture of Poisson distributions with agamma mixing distribution. We can therefore re-express equation 28 as follows:

�1 ⇠ Gam(1,

↵

1�↵

), �2 ⇠ Gam(1,

↵

1�↵

), g1 ⇠ Pois(�1), g2 ⇠ Pois(�2), ⌧ := g1 � g2. (29)

Finally, a gamma distribution with shape parameter equal to one is an exponential distribution, while the difference of twoindependent Poisson random variables is marginally a Skellam random variable (Skellam, 1946).

D. Relationship between the Bessel and Skellam distributionsTheorem D.1. Consider two Poisson random variables y1 ⇠ Pois(�1) and y2 ⇠ Pois(�2). Their minimum m :=

min{y1, y2} and their difference � := y1 � y2 are deterministic functions of y1 and y2. However, if not conditioned on y1

and y2, the random variables m and � can be marginally generated as follows:

� ⇠ Skel(�1,�2), m ⇠ Bes⇣

|�|, 2p

2�1�2

⌘

. (30)3See https://www.youtube.com/watch?v=V1EyqL1cqTE.

https://www.youtube.com/watch?v=V1EyqL1cqTE


Proof.

P (y1, y2) = Pois(y1;�1) Pois(y2;�2) (31)

=

�

y11

y1!e

��1�

y22

y2!e

��2 (32)

=

(

p�1�2)

y1+y2

y1! y2!e

�(�1+�2)

✓

�1

�2

◆(y1�y2) / 2

. (33)

If y1 � y2, then

P (y1, y2) =

(

p�1�2)

y1+y2

I

y1�y2(2

p�1�2) y1! y2!

e

�(�1+�2)

✓

�1

�2

◆(y1�y2) / 2

I

y1�y2(2

p

�1�2) (34)

= Bes⇣

y2; y1 � y2, 2p

�1�2

⌘

Skel(y1 � y2;�1,�2); (35)

otherwise

P (y1, y2) =

(

p�1�2)

y1+y2

I

y2�y1(2

p�1�2) y1! y2!

e

�(�1+�2)

✓

�2

�1

◆(y2�y1) / 2

I

y2�y1(2

p

�1�2) (36)

= Bes⇣

y1; y2 � y1, 2p

�1�2

⌘

Skel(y2 � y1;�2,�1)

= Bes⇣

y1; �(y1 � y2), 2p

�1�2

⌘

Skel(y1 � y2;�1,�2). (37)

Ifm := min{y1, y2}, � := y1 � y2, (38)

then

y2 = m, y1 = m + � if � � 0 (39)y1 = m, y2 = m � � otherwise (40)

and�

�

�

�

@y1

@m

@y1

@�

@y2

@m

@y2

@�

�

�

�

�

=

�

�

�

�

1 1

1 0

�

�

�

�

��0 ��

�

�

1 0

1 �1

�

�

�

�

�<0

= 1, (41)

so

P (m, �) = P (y1, y2)

�

�

�

�

@y1

@m

@y1

@�

@y2

@m

@y2

@�

�

�

�

�

= Bes⇣

m; |�|, 2p

�1�2

⌘

Skel(�;�1,�2). (42)

E. Coordinate-ascent variational inferenceTheorem 6.1. The Bessel distribution m⇠Bes(⌫, a) for fixed ⌫ is an exponential family with sufficient statistic T

⌫

(m)=

2m+⌫, natural parameter ⌘⌫

(m)=log

�

a

2

�

, and base measure h

⌫

(m)=

1m! �(m+⌫+1) .

Proof. The Bessel distribution (Yuan & Kalbfleisch, 2000) is a two-parameter distribution over the non-negative integers:

f(n; a, ⌫) =

�

a

2

�2n+⌫

n! �(n+⌫+1)I

⌫

(a)

, (43)

where the normalizing constant I

⌫

(a) is a modified Bessel function of the first kind—i.e.,

I

⌫

(a) =

1X

n=0

�

a

2

�2n+⌫

n! �(n+⌫+1)

. (44)


For fixed and known ⌫, we can rewrite the Bessel PMF as

f(n; a, ⌫) =

1

n! �(n+⌫+1)

exp

⇣

(2n+⌫) log

⇣

a

2

⌘

� log I

⌫

(a)

⌘

. (45)

We can then define the following functions:

h

⌫

(n) =

1

n! �(n+⌫+1)

(46)

T

⌫

(n) = 2n + ⌫ (47)

⌘

⌫

(a) = log

⇣

a

2

⌘

(48)

A

⌫

(a) = log I

⌫

(a). (49)

Finally, we can rewrite the Bessel PMF in the exponential-family form:

f(n; a, ⌫) = h

⌫

(n) exp (⌘

⌫

(a) · T

⌫

(n) � A

⌫

(a)). (50)

We can derive a full coordinate-ascent variational inference (CAVI) algorithm for locally private Poisson factorization. Forexposition, we focus on latent Dirichlet allocation, where y

dv

⇠ Pois(µdv

) and µ

dv

=

P

K

k=1 ✓dk

�

kv

. It is standard toassume independent gamma priors over the factors—i.e., ✓

dk

,�

kv

⇠ �(a0, b0). We use [X] = exp ( [ln X]) to denotethe geometric expected value of X .

E.1. Q(✓

dk

), Q(�

kv

)

The optimal variational distribution for the factors is same as in non-private Poisson factorization (Cemgil, 2009).

Q(✓

dk

) /Q

[P (y

dk

, ✓

dk

| �)] (51)

=

Q

"

Pois

y

dk

; ✓

dk

V

X

v=1

�

kv

!

� (✓

dk

; a0, b0)

#

(52)

/ �(✓

dk

; ↵

dk

, �

dk

) (53)↵

dk

:= a0 +

Q

[y

dv

] (54)

�

dk

:= b0 +

V

X

v=1

Q

[�

kv

] (55)

Q

[✓

dk

] =

↵

dk

�

dk

(56)

Q

[✓

dk

] = exp ( (↵

dk

) � ln(�

dk

)) . (57)

The derivation for �kv

is analogous.

E.2. Q

⇣

m

dv

, y

(+)dv

, g

dv1, gdv2, ydv

, (y

dvk

)

K

k=1

⌘

The relationships between the different count variables are as follows:

K

X

k=1

y

dvk

!

| {z }

=ydv

+g

dv1

| {z }

=y

(+)dv

�g

dv2

| {z }

=y

(±)dv

(58)

m

dv

= min

n

y

(+)dv

, g

dv2

o

(59)

We have a single variational distribution for these variables.


E.3. Different factorizations of the joint

To find this variational distribution, we consider the variables’ joint distribution:

P

⇣

g

dv1, gdv2, (ydvk

)

K

k=1 , y

dv

, y

(+)dv

, y

(±)dv

, m

dv

⌘

. (60)

The most straightforward factorization of this joint first generates all the Poisson random variables and then computes theremaining variables given their deterministic relationships to the underlying Poissons:

P

⇣

g

dv1, gdv2, (ydvk

)

K

k=1 , y

dv

, y

(+)dv

, y

(±)dv

, m

dv

⌘

= Pois(gdv1;�dv1) Pois(g

dv2;�dv2)

K

Y

k=1

Pois(ydvk

; ✓

dk

�

kv

)

!

y

dv

=

K

X

k=1

y

dvk

!

⇣

y

(+)dv

= y

dv

+ g

dv1

⌘ ⇣

y

(±)dv

= y

(+)dv

� g

dv2

⌘ ⇣

m

dv

= min{y

(+)dv

, g

dv2}⌘

. (61)

We can equivalently first generate the sums of Poissons and then thin them using multinomial and binomial draws. In thefollowing equation, the delta functions are implicitly present in the multinomial and binomial PMFs. Note that we write theprobability parameters in the multinomial and binomial PMFs as unnormalized vectors. Also note that µ

dv

=

P

k

✓

dk

�

kv

.

P

⇣

g

dv1, gdv2, (ydvk

)

K

k=1 , y

dv

, y

(+)dv

, y

(±)dv

, m

dv

⌘

= Pois(gdv2;�dv2) Pois

⇣

y

(+)dv

;�

dv1+µ

dv

⌘

Binom⇣

(y

dv

, g

dv1); y

(+)dv

, (µ

dv

,�

dv1)

⌘

Mult✓

(y

dvk

)

K

k=1 ; y

dv

,

⇣

✓

dk

�

kv

⌘

K

k=1

◆

⇣

y

(±)dv

= y

(+)dv

� g

dv2

⌘ ⇣

m

dv

= min{y

(+)dv

, g

dv2}⌘

. (62)

We can equivalently first generate the difference y

(±)dv

and minimum m

dv

as Skellam and Bessel random variables. Con-ditioned on these variables, we can then compute y

(+)dv

and g

dv2 via their deterministic relationship and, finally, thin y

(+)dv

using binomial and multinomial draws:

P

⇣

g

dv1, gdv2, (ydvk

)

K

k=1 , y

dv

, y

(+)dv

, y

(±)dv

, m

dv

⌘

= Skel⇣

y

(±)dv

; �

dv1+µ

dv

, �

dv2

⌘

Bes⇣

m

dv

; |y(±)dv

|, 2

p

�

dv2(�dv1 + µ

dv

)

⌘

⇣

y

(+)dv

= m

dv

⌘ (y(±)dv 0)

(g

dv2 = m

dv

)

(y(±)dv >0)

⇣

y

(±)dv

= y

(+)dv

� g

dv2

⌘

Binom⇣

(y

dv

, g

dv1); y

(+)dv

, (µ

dv

,�

dv1)

⌘

Mult✓

(y

dvk

)

K

k=1 ; y

dv

,

⇣

✓

dk

�

kv

⌘

K

k=1

◆

. (63)

It is this last factorization that enables us to derive the variational distribution.

E.4. Deriving the variational distribution

Recall that the observed data consists of the difference variables y

(±)dv

.

Q

⇣

m

dv

, y

(+)dv

, g

dv1, gdv2, ydv

, (y

dvk

)

K

k=1

⌘

/Q

h

P (y

(±)dv

, m

dv

, y

(+)dv

, g

dv1, gdv2, ydv

, (y

dvk

)

K

k=1 | �)

i

. (64)


Because the likelihood (i.e., the Skellam term) in equation 63 does not depend on any of these latent variables, it disappearsentirely. We can then rewrite the right-hand side of equation 64 as:

Q

h

P (m

dv

, y

(+)dv

, g

dv1, gdv2, ydv

, (y

dvk

)

K

k=1 | y(±)dv

�)

i

=

Q

h

Bes⇣

m

dv

; |y(±)dv

|, 2

p

�

dv2(�dv1 + µ

dv

)

⌘i

⇣

y

(+)dv

= m

dv

⌘ (y(±)dv 0)

(g

dv2 = m

dv

)

(y(±)dv >0)

⇣

y

(±)dv

= y

(+)dv

� g

dv2

⌘

Q

Binom⇣

(y

dv

, g

dv1); y

(+)dv

, (µ

dv

,�

dv1)

⌘

Mult✓

(y

dvk

)

K

k=1 ; y

dv

,

⇣

✓

dk

�

kv

⌘

K

k=1

◆�

. (65)

Theorem 6.1 states that the Bessel distribution for fixed first parameter is an exponential family. We can therefore usestandard results to push in the geometric expectations:

Q

h

P (m

dv

, y

(+)dv

, g

dv1, gdv2, ydv

, (y

dvk

)

K

k=1 | y(±)dv

�)

i

= Bes✓

m

dv

; |y(±)dv

|, 2

q

Q

[�

dv2] ( Q

[�

dv1 + µ

dv

])

◆

⇣

y

(+)dv

= m

dv

⌘ (y(±)dv 0)

(g

dv2 = m

dv

)

(y(±)dv >0)

⇣

y

(±)dv

= y

(+)dv

� g

dv2

⌘

Binom⇣

(y

dv

, g

dv1); Q

h

y

(+)dv

i

, (

Q

[µ

dv

] ,

Q

[�

dv1])

⌘

Mult✓

(y

dvk

)

K

k=1 ;

Q

[y

dv

] ,

⇣

Q

[✓

dk

�

kv

]

⌘

K

k=1

◆

. (66)

There are two expectations that do not have an analytic form:

Q

[�

dv1 + µ

dv

] = exp

Q

"

ln

�

dv1 +

K

X

k=1

✓

dk

�

kv

!#!

(67)

and

Q

[µ

dv

] = exp

Q

"

ln

K

X

k=1

✓

dk

�

kv

!#!

; (68)

however, both can be very closely approximated using the delta method (Ver Hoef, 2012) which has been previously used invariational inference schemes to approximate intractable expectations (Braun & McAuliffe, 2010; Wang & Blei, 2013). Inparticular, for some variable Y = f(X), expectation [Y ] is approximately:

[Y ] = [f(X)] ⇡ f ( [X]) +

1

2

f

00( [X]) [X]. (69)

On our case, we therefore have

Q

[ln µ

dv

] =

Q

"

ln

K

X

k=1

✓

dk

�

kv

!#

⇡ ln

Q

"

K

X

k=1

✓

dk

�

kv

#!

�Q

h

P

K

k=1 ✓dk

�

kv

i

2

⇣

Q

h

P

K

k=1 ✓dk

�

kv

i⌘2 (70)

= ln

K

X

k=1

Q

[✓

dk

]

Q

[�

kv

]

!

�P

K

k=1 Q

[✓

dk

�

kv

]

2

⇣

P

K

k=1 Q

[✓

dk

]

Q

[�

kv

]

⌘2 . (71)

Finally, because ✓dk

and �kv

are independent, we have

Q

[✓

dk

�

kv

] =

Q

[✓

dk

]

Q

[�

kv

] +

Q

[✓

dk

] (

Q

[�

kv

])

2+

Q

[�

kv

] (

Q

[✓

dk

])

2. (72)

Documents

Locally Private Bayesian Inference for Count Modelsaschein/ScheinWuZhouWallach... · 2019. 6. 6. · 2) topic modeling for text corpora. In order to formulate our local-privacy guarantees,