15
J. Parallel Distrib. Comput. 68 (2008) 78 – 92 www.elsevier.com/locate/jpdc Distributed probabilistic inferencing in sensor networks using variational approximation Sourav Mukherjee, Hillol Kargupta , 1 Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, USA Received 11 March 2007; received in revised form 10 July 2007; accepted 10 July 2007 Available online 23 August 2007 Abstract This paper considers the problem of distributed inferencing in a sensor network. It particularly explores the probabilistic inferencing problem in the context of a distributed Boltzmann machine-based framework for monitoring the network. The paper offers a variational mean-field approach to develop communication-efficient local algorithm for variational inferencing in distributed environments (VIDE). It compares the performance of the proposed approximate variational technique with respect to the exact and centralized techniques. It shows that the VIDE offers a much more communication-efficient solution at very little cost in terms of the accuracy. It also offers experimental results in order to substantiate the scalability of the proposed algorithm. © 2007 Elsevier Inc. All rights reserved. Keywords: Probabilistic Inferencing; Variational Approximation; Distributed Data Mining; Sensor Network; Data Mining 1. Introduction Probabilistic inferencing lies at the core of a wide range of sensor network applications, such as target tracking, target lo- cation, and distributed sensor calibration. Probabilistic infer- encing can be viewed as the task of computing the posterior probability distribution of a set of hidden variables, given a set of visible variables (also known as evidence variables or ob- served variables), and a probability model for the underlying joint distribution of the hidden and the visible variables [26]. In this paper, we consider the problem of inferencing in a heterogeneously distributed environment where each visible variable is measured in a different node, and each node is re- quired to compute the probability distribution of the hidden variables, given the visible variables. The problem of exact in- ferencing, even when the data are centralized, is intractable, and variational methods provide a deterministic approximation technique used in such an inferencing [15,12]. This paper ex- plores the possibility of variational approximation in distributed Corresponding author. E-mail addresses: [email protected] (S. Mukherjee), [email protected] (H. Kargupta). 1 Also affiliated to Agnik, LLC. 0743-7315/$ - see front matter © 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2007.07.011 computation for inferencing in a distributed Boltzmann ma- chine. The paper presents an algorithm, variational inference in distributed environment (VIDE), which can perform proba- bilistic inferencing in heterogeneously distributed environments using variational approximation. The rest of this paper is organized as follows. Section 2 mo- tivates and formally defines the problem of distributed infer- encing in the context of sensor network applications. Section 3 offers a brief survey of existing work that is relevant. Sec- tion 4 shows how the inferencing problem can be posed as a variational optimization problem. Section 5 presents our algo- rithm, VIDE, which extends the variational approach to the het- erogeneously distributed environment. It shows that variational inference in a heterogeneously distributed system reduces to a distributed average computation, followed by a purely local phase of computation. Section 6 analyzes the communication complexity of VIDE. It also presents an analysis of the energy consumption of VIDE. Section 7 presents the experimental re- sults. Finally, Section 8 concludes this paper. 2. Motivation and problem definition A large number of sensor network applications involve prob- abilistic inferencing. This section presents some applications,

Distributed probabilistic inferencing in sensor networks using variational approximation

Embed Size (px)

Citation preview

J. Parallel Distrib. Comput. 68 (2008) 78–92www.elsevier.com/locate/jpdc

Distributed probabilistic inferencing in sensor networks usingvariational approximation

Sourav Mukherjee, Hillol Kargupta∗,1

Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, USA

Received 11 March 2007; received in revised form 10 July 2007; accepted 10 July 2007Available online 23 August 2007

Abstract

This paper considers the problem of distributed inferencing in a sensor network. It particularly explores the probabilistic inferencing problemin the context of a distributed Boltzmann machine-based framework for monitoring the network. The paper offers a variational mean-fieldapproach to develop communication-efficient local algorithm for variational inferencing in distributed environments (VIDE). It compares theperformance of the proposed approximate variational technique with respect to the exact and centralized techniques. It shows that the VIDEoffers a much more communication-efficient solution at very little cost in terms of the accuracy. It also offers experimental results in order tosubstantiate the scalability of the proposed algorithm.© 2007 Elsevier Inc. All rights reserved.

Keywords: Probabilistic Inferencing; Variational Approximation; Distributed Data Mining; Sensor Network; Data Mining

1. Introduction

Probabilistic inferencing lies at the core of a wide range ofsensor network applications, such as target tracking, target lo-cation, and distributed sensor calibration. Probabilistic infer-encing can be viewed as the task of computing the posteriorprobability distribution of a set of hidden variables, given a setof visible variables (also known as evidence variables or ob-served variables), and a probability model for the underlyingjoint distribution of the hidden and the visible variables [26].

In this paper, we consider the problem of inferencing ina heterogeneously distributed environment where each visiblevariable is measured in a different node, and each node is re-quired to compute the probability distribution of the hiddenvariables, given the visible variables. The problem of exact in-ferencing, even when the data are centralized, is intractable,and variational methods provide a deterministic approximationtechnique used in such an inferencing [15,12]. This paper ex-plores the possibility of variational approximation in distributed

∗ Corresponding author.E-mail addresses: [email protected] (S. Mukherjee),

[email protected] (H. Kargupta).1 Also affiliated to Agnik, LLC.

0743-7315/$ - see front matter © 2007 Elsevier Inc. All rights reserved.doi:10.1016/j.jpdc.2007.07.011

computation for inferencing in a distributed Boltzmann ma-chine. The paper presents an algorithm, variational inferencein distributed environment (VIDE), which can perform proba-bilistic inferencing in heterogeneously distributed environmentsusing variational approximation.

The rest of this paper is organized as follows. Section 2 mo-tivates and formally defines the problem of distributed infer-encing in the context of sensor network applications. Section3 offers a brief survey of existing work that is relevant. Sec-tion 4 shows how the inferencing problem can be posed as avariational optimization problem. Section 5 presents our algo-rithm, VIDE, which extends the variational approach to the het-erogeneously distributed environment. It shows that variationalinference in a heterogeneously distributed system reduces toa distributed average computation, followed by a purely localphase of computation. Section 6 analyzes the communicationcomplexity of VIDE. It also presents an analysis of the energyconsumption of VIDE. Section 7 presents the experimental re-sults. Finally, Section 8 concludes this paper.

2. Motivation and problem definition

A large number of sensor network applications involve prob-abilistic inferencing. This section presents some applications,

S. Mukherjee, H. Kargupta / J. Parallel Distrib. Comput. 68 (2008) 78–92 79

which motivate the distributed probabilistic inferencing prob-lem. Next, it presents a formal definition on the distributed in-ferencing problem.

Consider the problem of target classification in a sensor net-work. An object is present in a field of sensor, and several at-tributes of that object are being measured by the sensors. Thegoal is to classify the object. If the measured attributes are con-sidered to be visible variables, and the class label to be a hid-den variable, then the target classification problem requires anestimation of the probability distribution of the hidden variablegiven the visible variables.

A similar problem is that of target localization in sensornetworks. Several attributes of a possibly mobile object arebeing measured by a network of sensors. The task is to estimatethe position of the object. For example, in a battlefield whereacoustic sensor networks have been deployed, we may want toestimate the position where a fire-arm was just triggered. Again,we can model the co-ordinates of the object as hidden variables,and the sensor network measurements as the visible variables.For estimating the position of the object, we may need to knowthe probability distribution of the hidden variables given thevisible variables.

As a third example, we can consider the distributed sensorcalibration problem [2,26]. Let us suppose that some variable(e.g. temperature) is being measured by a set of sensors. Themeasurements taken by the individual sensors may have biases,depending on the measurement technique used. However, if wehave a probability model involving the true temperatures, themeasured temperatures, and the biases, then we can computethe posterior probability of the true temperatures, given themeasured temperatures.

All the above examples are special cases of the followingabstract problem:

Probabilistic inferencing problem: Let X = (Xh, Xv) be arandom vector, where Xh and Xv consist of the hidden and thevisible attributes, respectively. We are also given a model for thejoint probability, P(xh, xv|�), where � is the parameter vectorfor the model. We wish to compute the posterior probability ofthe hidden variables, given the visible variables, that is,

P(xh|xv; �) = P(xh, xv|�)

P (xv|�).

The probability model is usually expressed as a graphicalmodel. We shall see, in the next section, that performing exactinferencing in a graphical model is an intractable problem,and so we need effective approximation techniques to solvethe problem. We shall also see, later, that variational meth-ods provide a deterministic approximation technique to suchproblems.

Another aspect common to all the above scenarios is thateach node in the network measures a distinct visible randomvariable, that is, the data are completely heterogeneously dis-tributed. The naïve way to perform inferencing in such a systemwould be to centralize all the data at one site and then run aninferencing algorithm. However, sensor nodes have limited bat-tery life, and communication accounts for the majority of thepower consumption in sensor nodes. This makes the above ap-

proach impracticable. Moreover, the system may need to takedecisions based on the inference, within real-time constraints.In that case, the limitations in communication bandwidth arelikely to restrict the system’s capability to react within real-time constraints. Thus, we need a way to compute the desireddistribution with minimum communication. This motivates thedefinition of the distributed probabilistic inferencing problem:

Distributed probabilistic inferencing problem: Formally, letXv = {X(1)

v , X(2)v , . . . , X

(p)v } such that X

(i)v is measured at node

Ni of a network. Our problem is to estimate P(xh|xv; �) whileminimizing the communication between the nodes of the sensornetwork.

As mentioned earlier, the probability distribution of X isusually expressed as a graphical model. A graphical modelconsists of a set of nodes, representing the random variables,and, a set of edges, representing dependencies between pairsof random variables [12].

For concreteness, we shall consider a special kind ofgraphical model, known as the Boltzmann machine. Thereasons for selecting Boltzmann machines in this study aretwo-fold:(1) The properties of Boltzmann machines as probability mod-

els have been studied extensively in the literature (see, forexample, [9]).

(2) Boltzmann machines have been shown to be very usefultools in pattern recognition problems. For example, Boltz-mann machines have been used in recognition of hand-written digits [8]. The usefulness of such models in facerecognition has been demonstrated in [31]. Boltzmann ma-chines have also been used in biometric authenticationthrough fingerprint recognition, as described in [34].

Thus, the Boltzmann machine constitutes a graphical modelthat is both well understood and applicable to pattern recog-nition problems. The present work addresses the problemof performing inference in such a graphical model, wheneach visible variable is measured or detected at a separatenode.

The paper also frequently uses the term “local algorithms”.Let us explain what the term means in this paper. 2 Consider anetwork represented by an abstract tuple G = 〈V, E, CV,t , DV ,

PE〉. V and E represent the vertex and edge sets of the undi-rected network-graph; CV,t represents the set of all states Cv,t

of a vertex v ∈ V at time t. DV = {D1, D2, . . . , DN }, whereDi represents the data stored at node i of the network, 1� i�N ,and N is the number of nodes in the network. Thus, DV repre-sents the data stored across the network. PE represents prop-erties of every edge in E. The distance (distG(u, v)) betweena pair of nodes u and v is the length of the shortest path be-tween those vertices. The �-neighborhood of a node u ∈ V isdefined as ��(u, G) = {v|distG(u, v)��}. Let each node v ∈V store a data set Xv . An �-local query by some vertex v is aquery whose response can be computed using some function

2 For a more comprehensive discussion on locality, please refer to thepaper entitled Distributed Identification of Top-/Inner Product Elementsand its Applications in a Peer to Peer Network, by K Bhaduri et al.(in communication).

80 S. Mukherjee, H. Kargupta / J. Parallel Distrib. Comput. 68 (2008) 78–92

f (X�(v)) where X�(v) = {Xv|v ∈ ��(v, V )}. An algorithm iscalled �-local if it never requires computation of a �-local querysuch that � > �. In the rest of this paper, the term local algo-rithm will be loosely used for implying an �-local algorithmwhere � is either a small constant or a slowly growing functionwith respect to the network parameters such as the number ofnodes. The next section presents a review of the related workdone in this field.

3. Related work

As mentioned earlier, target tracking, target localization, etc.are important sensor network applications which constitute spe-cial examples of the distributed inferencing problem; in recenttimes, they have been studied extensively (see, for example,[20,6]). The distributed sensor calibration problem, which isanother example of the distributed inferencing problem, is de-scribed in [2,26].

Distributed data mining (DDM) has recently emerged as anextremely important area of research. DDM is concerned withanalysis of data in distributed environments, while paying care-ful attention to issues related to computation, communication,storage, and human–computer interaction. The data can be dis-tributed homogeneously, where all sites share the same schema,but each site stores attribute values of a different set of tu-ples. Alternatively, the data can be distributed heterogeneously,where all sites store the same set of tuples, but each site storesa different set of attributes. In the latter scenario, each sitemust store some key attribute, such that the join of the localdata sources is lossless. Detailed surveys of DDM algorithmsand techniques have been presented in [16–18]. Some of thecommon data-analysis tasks include association rule mining,clustering, classification, kernel density estimation, and so on.A significant amount of research has been done in the recentyears to develop techniques for such data analysis, suitable fordistributed environments. For example, an algorithm for dis-tributed association rule mining in peer-to-peer systems hasbeen presented in [33]. K-means clustering has been extendedto the distributed scenario in [5]. Techniques for performingnon-parametric density estimation over homogeneously dis-tributed data have been studied and experimentally evaluated in[7]. The problem addressed in the present paper, probabilisticinferencing in distributed environments, is a problem in DDM,because it involves performing inferencing in a database that isdistributed in a completely heterogeneous way, with each nodestoring only a single visible variable.

The applicability of graphical models and belief propagationto distributed inferencing in sensor networks has been studiedin [3,4,27,26]. The underlying probability distribution is repre-sented as a graph, and inferencing proceeds as a sequence ofmessage-passing operations between nodes of the graph [11].In [27], such a belief propagation-based inferencing has beenapplied to scenarios with discrete random variables; in [4], ithas been applied to multivariate Gaussian distributions.

Graphical models can be of two types: directed and undi-rected. In an undirected graphical model, the joint probabilityof all the random variables is given by the (normalized) prod-

uct of local potential functions, where each potential functioncorresponds to a clique in the graph. A representative of theexact inferencing algorithms is the junction-tree algorithm [13]which operates on an undirected graphical model; if the originalmodel is directed, then it converts it to an undirected model byan operation called moralization. The moralized graph is thentriangulated, such that there are no 4-cycles without a chord.Then, the cliques of the graph can be arranged in a junction-treedata structure. It is important that, if two cliques have a nodein common, then they assign the same marginal probability tothat node (this is called the local consistency property). Mainte-nance of local consistency requires marginalization and rescal-ing the clique potentials. But, as observed in [12], this can taketime that is exponential in the size of the largest clique in thenetwork. Let, for example, the nodes in a particular clique beX1, X2, . . . , Xn, and let �(x1, x2, . . . , xn) be the correspond-ing clique potential. Then, for any X1 = x1, the marginal prob-ability of X1 would be given by

∑x2,x3,...,xn

�(x1, x2, . . . , xn)

which takes time O(2n−1). This suggests that any practical so-lution to the inferencing problem must resort to approxima-tion techniques. Variational approximation is a deterministicapproximation technique that has been shown to be suitable forthis problem [15,12].

Boltzmann machines form an important and widely re-searched class of graphical models [14]. In fact, they were thefirst explicitly stochastic networks for which a learning rulewas developed [1,29]. Traditionally, inferencing and learningin Boltzmann machines was done using Gibbs sampling; later,mean-field approximations were applied [28]. Another inter-esting approach to the problem is decimation [30], which isefficient in only certain subclasses of Boltzmann machines,such as Boltzmann trees and chains. Boltzmann machines havebeen applied in many pattern recognition problems, such asrecognition of handwritten digits [8], face recognition [31],and fingerprint recognition [34]. The present work deals withthe scenario where each visible variable in the Boltzmann ma-chine is measured at a different node of the network, whichimplies that a distributed computation is necessary to performthe inferencing.

Approximation algorithms often provide practical solutionsto problems that are intractable. Approximation algorithms canbe classified into two broad categories: probabilistic approxi-mation algorithms (or randomized algorithms) [25] and deter-ministic approximation algorithms. In a probabilistic approxi-mation algorithm, the output of a problem instance depends onboth the input and the output of a pseudo-random number gen-erator. Deterministic approximation techniques, on the otherhand, do not have any element of randomness involved. Vari-ational approximation constitutes an important technique thatfalls under this category. Usually, the variational approach tosolving a problem involves defining an objective function thatis optimized at the solution to the problem, and then searchingfor an optimum of that function. Often, this search is donein a restricted subset of the solution space, which makes thetechnique computationally tractable. A typical example is thefinite element method (FEM). Suppose we wish to solve a dif-ferential equation, for which deriving a closed-form solution is

S. Mukherjee, H. Kargupta / J. Parallel Distrib. Comput. 68 (2008) 78–92 81

difficult. As described in [12], the variational approach tosolving this problem involves defining a variational objectivefunction, and then searching for a solution that optimizes thisobjective function. However, since it is not practicable tosearch in the space of all functions, we consider only thosefunctions that can be expressed as linear combinations of afinite set of basis functions. Thus, the problem reduces to eval-uation of the coefficients of these basis functions, which is asearch problem in finite dimensional space.

Jordan et al. [15] and Jaakkola [12] have developed theframework for applying variational optimization to graphi-cal models, where exact inferencing is intractable. They haveaddressed problems like inferencing, parameter estimation us-ing expectation maximization (EM) and Bayesian parameterestimation. However, their work assumes that the data arecentralized.

The standard variational mean-field approximation for infer-encing in Boltzmann machines is to assume that the hiddenvariables are independent, given the visible variables (this is de-scribed formally in the next section). This paper demonstratesthe applicability of this approach in the distributed scenario.

Paskin et al. [26] have developed a junction-tree-basedframework for performing exact inferencing in distributedgraphical models. However, they do not adopt a variationalapproach.

Vlassis et al. [32] have applied the variational EM approachto compute the parameters of a mixture of Gaussian distribution.In [19], they have shown that the variational formulation, whenapplied to the homogeneously distributed scenario, reduces toa distributed average computation problem. We, too, have useddistributed average computation as an essential part of our al-gorithm; however, we are concerned with the heterogeneouslydistributed scenario, and the graphical model we choose is theBoltzmann machine.

This paper demonstrates that the variational approximationtechnique is applicable to the problem of probabilistic inferenc-ing in heterogeneously distributed environments. In particular,it makes the following contributions:• It extends the variational formulation of the probabilistic in-

ferencing problem to the heterogeneously distributed sce-nario.• Based on this formulation, it presents a deterministic ap-

proximation algorithm, VIDE, for solving this problem.• It presents theoretical analysis and experimental evaluation

of the accuracy and communication cost of VIDE, and es-tablishes VIDE as a communication-efficient technique forsolving the distributed inferencing problem.In the next section, we consider the problem of probabilis-

tic inferencing in a Boltzmann machine, and its variationalformulation.

4. Background

In this section, we first briefly review the variational methodas a deterministic approximation technique for solving prob-lems, applicable when finding that the exact solution is tooexpensive. Then, we present the problem of inferencing in a

Boltzmann machine, and explain why solving it naïvely is in-tractable. Finally, we present the variational mean-field tech-nique for solving the inferencing problem.

4.1. The variational approximation technique

In this section, we develop the basic ideas of variational ap-proximation. To illustrate the technique, we shall use a simpleproblem: linear regression. The variational technique, as ap-plied to this problem, is explained in detail in [12]; this sectionpresents a brief summary of that discussion.

Illustrative example: linear regression: Consider a linearfunction f : Rd → R. The task is to find f, given the values ofyi = f (xi), for 1� i�n and {xi : 1� i�n} ⊂ Rd . Since f islinear, the problem reduces to finding the weight vector w ∈ Rd

such that y = f (x) = wT x.The exact solution w∗ to this problem is given by

w∗ = C−1b,

where

C =n∑

i=1

xixTi

and

b =n∑

i=1

yixi.

Although, in this problem, we have a closed-form expressionfor the exact solution, computing the exact solution involvescomputing the inverse of C, which can become expensive, ifthe dimensionality d is large.

The variational method attempts to find a solution to a givenproblem by optimizing an objective function. Depending on thenature of the objective function, and on the technique used tooptimize it, the solution obtained using the variational methodmay be exact or approximate.

For the linear regression problem, let us define the followingobjective function:

J (w) = 12 (w∗ − w)T C(w∗ − w)

which is the distance between the approximate solution w andthe exact solution w∗ weighted by the matrix C.

Thus, J (w)�0 and J (w) = 0⇔ w = w∗.Thus, the variational formulation of this problem is simplyMinimize:

J (w) = 12 (w∗ − w)T C(w∗ − w).

Although the expression for J (w) involves w∗, which we donot know, we can still use the fact that w∗ = C−1b to prove that

J (w) = 12 bT C−1b− wT b+ 1

2 wT Cw

which is an expression for J (w) that does not contain w∗.Finally, the above expression indicates that minimizing J (w)

is equivalent to minimizing the following function:

J (w) = −wT b+ 12 wT Cw.

82 S. Mukherjee, H. Kargupta / J. Parallel Distrib. Comput. 68 (2008) 78–92

This variational formulation is very useful. In this example, theobjective function J (w) is quadratic in w and hence, convex(this, however, is not the case in many other problems wherethe variational approximation is applied). As a result, a simplegradient descent will converge to the global minimum.

Thus, the key step in solving a computationally difficultproblem by the variational method involves defining an objec-tive function, which is optimized at the exact solution. Havingdefined this function, the problem then reduces to a searchproblem that aims to find an approximate optimal point of thisfunction.

In the next section, we formally define the problem of prob-abilistic inference in a Boltzmann machine; we argue that ex-act solution to this problem takes exponential time. Then, wedemonstrate how the variational method is used to obtain anapproximate solution.

4.2. Boltzmann machine

Consider a random vector X, comprised of binary randomvariables having range {0, 1} (we are interested in finite-statedistributions; thus a k-ary random variable can be expressed asa finite sequence of binary random variables, and so assumingthe random variables to be binary does not lead to any loss ofgenerality). Some of the components of X are visible, whileothers are hidden. Let Xh be the vector containing only thehidden components, and let Xv be the vector containing onlythe visible components, such that X = (Xh, Xv).

We assume that the distribution of X is a Boltzmann machinewith parameter vector �. That is,

P(x|�) = 1

Zexp(−�(x; �)),

where

�(x; �) = −⎛⎝∑

i<j

�i,j xixj +∑

i

�i,0xi

⎞⎠

is the energy function, and

Z =∑

x

exp(−�(x; �))

is the normalizing constant, called the partition function.The basic inferencing problem is to compute the probability

distribution of the hidden variables, given the visible variables,

P(xh|xv; �) = P(xh, xv|�)

P (xv|�).

Actually, if the distribution P(x|�) is a Boltzmann machine,then the distribution P(xh|xv; �) is also a Boltzmann machine,given by

P(xh|xv; �) = 1

Zcexp(−�c(xh; �)),

where

�c(xh; �) = −⎛⎝ ∑

i<j ; xi ,xj∈xh

�i,j xixj +∑xi∈xh

�ci,0xi

⎞⎠

such that the indices i, j vary only over the hidden variables,and

�ci,0 = �i,0 +

∑xj∈xv

�i,j xj .

This is because, if, for any i, j , both Xi, Xj are hidden, thenthey continue to contribute the quadratic exponent �i,j xixj inthe numerator. Otherwise, if, say Xi is hidden and Xj is visible,then xj becomes a constant, so the exponent becomes linearin xi ; this explains why �c

i,0 has a contribution �i,j xj , whenXj is visible. Finally, when both Xi, Xj are visible, then theexponential of �i,j xixj becomes a constant, and is included inthe overall normalization constant Zc.

The normalization constant, Zc, is, then, the partition func-tion of this new Boltzmann machine. Exact computation of thepartition function, however, is intractable because

Zc =∑xh

exp(−�c(xh; �))

which is a summation over all possible values of the hiddenvariable vector xh ∈ {0, 1}H , where H is the number of hiddenvariables. The number of terms in this summation is,therefore,2H . Since this is exponential in the number of hidden variables,we need approximate techniques to estimate the distribution.

4.3. Variational mean-field approximation

Variational methods try to approximate P(xh|xv; �) by a dis-tribution Q(xh) which is computationally tractable. In partic-ular, the mean-field approximation assumes that Q factorizescompletely across the hidden variables

Q(xh) =∏i

Qi(xi).

To quantify the accuracy of the approximation, we use thefollowing standard objective function [23,15,12] (which is tobe maximized):

J (Q) = −∑xh

Q(xh) lnQ(xh)

exp(−�c(xh; �)).

Now if we define each of the individual marginal distributionsQi to be a binary distribution with mean �i , that is,

Qi(xi) = �xi

i (1− �i )1−xi ,

then the parameters �i that maximize J (Q) are given by theset of mean-field equations (one for each �i):

�i = �

⎛⎝∑

j>i

�i,j�j + �ci,0

⎞⎠ ,

where �(z) = 1/(1+ e−z) is the logistic function.The following section presents our algorithm, VIDE, for solv-

ing this problem in a distributed setting, such as in a sensornetwork environment.

S. Mukherjee, H. Kargupta / J. Parallel Distrib. Comput. 68 (2008) 78–92 83

5. Variational inferencing in distributed environments(VIDE)

Consider the case when Xv = {X(1)v , X

(2)v , . . . , X

(p)v } such

that the ith visible variable, X(i)v , is measured at node Ni of a

network. This corresponds to a vertical partitioning of Xv, inwhich, at any instant of time, the ith attribute of xv, x

(i)v , is

stored in the ith node Ni . Also, each node knows the parametersof the distribution �. Given that, at any instant of time, nodeNi measures X

(i)v = x

(i)v , we want each node to be able to

compute the distribution P(xh|xv; �) by computing the solutionvector μ to the mean-field equations. We propose an algorithm,VIDE, described in Algorithm 1 which solves this problem ina distributed manner.

Algorithm 1 VIDE for node k.Input: The following are the arguments to VIDE:

(1) The value of the visible variable X(k)v measured at this

node.(2) The value of �i,j for every pair of random variables

Xi, Xj .(3) The value of �i,0 for every random variable Xi .

Output: The mean �i of the marginal distribution under thevariational approximation, for every hidden variable Xi .Let p be the number of nodes in the network.

1: for each hidden variable Xi do2: �i ← Distributed_Average({�i,j xj : xj ∈ xv})3: �c

i,0 ← �i,0 + p · �i

4: Also, initialize �i ← 12

5: end for6: while μ has changed significantly in the last iteration do7: for each hidden variable Xi do8: �i ← �(

∑j>i �j + �c

i,0)

9: end for10: end while11: return μ.

The computation can be broken down into two phases, whichmust be performed by each node:(1) Calculate, for each i,

�ci,0 = �i,0 +

∑xj∈xv

�i,j xj .

(2) Iteratively, find a fixed point solution for the mean-fieldequations as

�(t+1)i = �

⎛⎝∑

j>i

�i,j�(t)j + �c

i,0

⎞⎠ ,

where t denotes the number of iterations. This can be donein a purely local fashion, as each node already knows �c

i,0(for every i), and all the other parameters in the aboveequation are stored locally.

The crux of the computation, therefore, lies in computingsums of the form s = ∑p

i=1 si where each si is stored in adifferent node. Now, if we assume the network (not the data itmeasures) is static, then we can write the sum as

s = ps,

where

s = 1

p

p∑i=1

si

is the arithmetic average of {si}pi=1.Thus, we can compute the sum in a distributed fashion by

computing the corresponding average, and then multiplying itby the number of sites.

If the network is dynamic, however, then we will also needa distributed algorithm for estimating the network size. Fortu-nately, algorithms for estimating network size from local in-formation exist [10]. However, in our experiments, we haveconsidered only static networks.

Of the several algorithms for distributed average computa-tion, we choose the A1 algorithm proposed by Mehyar et al.[24]. This can be considered as one implementation of theDistributed_Average function used by VIDE.

Distributed average computation: In this section, we describethe distributed averaging algorithm used by VIDE. Each nodek stores a scalar, zk , and an estimate yk of the global average.It is the aim of the algorithm to achieve the condition: yk ≈1p

∑pi=1 zi . Algorithm 2 describes the algorithm in brief.

Algorithm 2 A1_Average({zi}pi=1) for node k.

Input: The following are the inputs to the algorithm:(1) zk , the value stored in node k, and,(2) �k , the step-size parameter used by node k for averag-

ing.Output: yk ≈ 1

p

∑pi=1 zi

1: Initialize: yk = zk

2: while true do3: while every neighbor i such that i < k has not updated

k do4: if such a neighbor i initiates update with node k then5: if k is already executing an update with some other

neighbor j then6: Send negative acknowledgement (NACK) to

node i. A NACK message will cause this updateoperation to fail, and will have no effect on thestate of the network.

7: else8: Increment = �k(yi − yk)

9: Update for node k:

yk = yk + Increment

10: Send Increment to node i11: Update for node i:

yi = yi − Increment

12: end if13: end if14: end while15: Now sequentially initiate update with every neighbor m

such that m > k

16: end while

84 S. Mukherjee, H. Kargupta / J. Parallel Distrib. Comput. 68 (2008) 78–92

Intuitively, the algorithm works as follows. Each node k startsby setting its average estimate to the value it stores, i.e. yk =zk . At each time step, it communicates its current estimate ofthe average with some immediate neighbor; then, these twoneighbors adjust their estimates so that the two estimates arenow closer, and such that their sum is conserved. Thus, after asufficient number of such updates have occurred in the network,the average estimates of all the nodes will be very close toone another. Since their sum will still be the sum of all thevalues stored in the network (sum of the average estimates isconserved in an update), these estimates will be close to thetrue average across the network. For a more detailed study ofthe algorithm, the reader is referred to [24].

It is clear, from lines 1 and 2 of Algorithm 1 that VIDEneeds to invoke the distributed averaging routine once foreach hidden variable X

(i)h . Conceptually, the A1 algorithm

runs forever at each node. In the implementation of VIDE,however, the averaging process terminates when a predeter-mined maximum number of messages have been passed in thenetwork.

In the next section, we present a theoretical analysis ofthe accuracy of VIDE, as a function of the communicationcomplexity.

6. Analysis

In this section, we present error bounds for VIDE, and showthat the bounds become exponentially tighter, as the number ofmessages exchanged increases. Then, we derive an expressionfor the energy consumption of VIDE, which is important froma sensor network viewpoint.

Throughout this analysis, we assume that H is the numberof hidden variables, and V is the number of visible variables.

The accuracy of VIDE will be affected by the followingfactors:(1) The error in computing the parameters {�c

i,0}Hi=0 using thedistributed averaging algorithm. As more messages are ex-changed, this error becomes smaller.

(2) The error in computing {�i}Hi=0 by solving the mean-fieldequations, as the result of the error in {�c

i,0}Hi=0.First, we consider the error due to the distributed averaging

algorithm. Then we see how this affects the error in VIDE’sresults.

6.1. Error in distributed averaging

Our error analysis will be based on the analysis presented in[24]. Let us assume that there is a network with V nodes, andeach node i stores a scalar zi . Each note wishes to estimatez =∑V

i=1 zi . Let yi(t) be the estimate of this average, at nodei and time t. Initially, yi(0) = zi for all i, and the conservationproperty of the algorithm guarantees that y(t) =∑V

i=0 yi(t) =z for all times t.

To quantify the convergence properties, the authors of [24]use the following metric:

P(t) =∑

1� i<j �V

|yi(t)− yj (t)|.

Thus, at time t, P(t) is the sum of the difference in the averageestimates between every pair of nodes. The minimum value ofP(t) is 0, which happens when every node has exactly the sameestimate for the average.

It is shown in the paper that, given any time t, there existst ′ > t such that there has been at least one update in each link,during the time interval [t, t ′]. Further,

P(t ′)�(

1− 8�∗

V 2

)P(t),

where �∗ = mink min{�k, 1− �k}.The authors use this result to demonstrate that

limt→∞P(t) = 0.

We will take this analysis one step further, and present a con-vergence rate.

First, we assume that there exists T > 0 such that within anytime interval of length T, there is guaranteed to be at least oneupdate along each link. Such a restriction is desirable in anypractical implementation of the eventual update assumption.

Then, the time interval [0, t] contains at least tT� subinter-

vals of length T. Applying the above result, we get

P(t)�P(0)

(1− 8

�∗

V 2

) t/T �.

We use this to prove the following result.

Theorem 6.1. Let U be the number of updates that occur inthe time interval [0, t], and let L be the total number of linksin the network. Then,

|yi(t)− z|�P(0)

(1− 8

�∗

V 2

) U/LT �.

Proof. Note that since U is the number of updates that occurredin the interval [0, t], then U is proportional to the number ofmessages exchanged. Also, since the number of links in thenetwork is L then U � tL which implies

P(t)�P(0)

(1− 8

�∗

V 2

) U/LT �.

Finally, if we assume, without loss of generality, that

y1(t)�y2(t)� · · · �yV (t),

then it is easy to see that, for all i, |yi(t)−y(t)|� |y1(t)−yV (t)|and that

|y1(t)− yV (t)| =∣∣∣∣∣V−1∑i=1

(yi(t)− yi+1(t))

∣∣∣∣∣�

V−1∑i=1

|yi(t)− yi+1(t)|�P(t).

S. Mukherjee, H. Kargupta / J. Parallel Distrib. Comput. 68 (2008) 78–92 85

This gives us the convergence rate of the averaging algorithm:

|yi(t)− z|�P(0)

(1− 8

�∗

V 2

) U/LT �

which completes the proof. �

In VIDE, we compute each �ci,0 using the above algorithm

(here, i is an index that refers to the ith hidden variable), andso the above convergence rate applies to each such �c

i,0.In the next section we show how the error in solving the

mean-field equations depends on the error in computing theseaverage estimates.

6.2. Error in solving the mean-field equations

After the distributed averaging phase, each node will haveits estimate of �c

i,0, corresponding to the ith hidden variable

X(i)h . In the subsequent phase, each node solves the system of

mean-field equations, to estimate the mean �i of each hiddenvariable X

(i)h . This is a purely local computation; the error in

this computation originates from:(1) The error in �c

i,0 for each hidden variable X(i)h .

(2) The error tolerance such that the iterative algorithm

�(t+1)i = �

⎛⎝∑

j>i

�i,j�(t)j + �c

i,0

⎞⎠

stops when ‖μ(t+1) − μ(t)‖ < .Out of these, the second factor does not, in any way, affect thecommunication complexity of our algorithm. Smaller values of will lead to more iterations of a purely local computation.Hence, we will concentrate only on the first factor, and as-sume that, given {�c

i,0}Hi=0, the mean-field equations are solvedexactly.

Let ��i be the error in computing �i , the mean of themarginal distribution Qi corresponding to the ith variableX

(i)h . Let ��c

i,0 be the error in computing �ci,0, incurred by the

distributed averaging algorithm. Then, we have the followingresult:

Theorem 6.2. If {|��ci,0|}Hi=1 are small enough so that we can

ignore error terms of order 2 and higher, then,

|��i |�H∑

j=i+1

|�i,j‖��j | + |��ci,0|.

Proof. We have

�i =1

1+ exp (−[∑Hj=i+1 �i,j�j + �c

i,0])

= f ({�j }Hj=i+1, �ci,0).

Considering only first order errors, we have,

��i =H∑

j=i+1

�f

��j

��j +�f

��ci,0

��ci,0.

Taking magnitude on both the sides yields

|��i | =∣∣∣∣∣∣

H∑j=i+1

�f

��j

��j +�f

��ci,0

��ci,0

∣∣∣∣∣∣�

H∑j=i+1

∣∣∣∣∣ �f

��j

��j

∣∣∣∣∣+∣∣∣∣∣ �f

��ci,0

��ci,0

∣∣∣∣∣ .Now, it is straightforward to calculate these partial derivatives:

�f

��j

= �i,j

exp (−[∑Hj=i+1 �i,j�j + �c

i,0])(1+ exp (−[∑H

j=i+1 �i,j�j + �ci,0]))2

which gives ∣∣∣∣∣ �f

��j

∣∣∣∣∣ � |�i,j |,

and

�f

��ci,0= exp (−[∑H

j=i+1 �i,j�j + �ci,0])

(1+ exp (−[∑Hj=i+1 �i,j�j + �c

i,0]))2

which gives ∣∣∣∣∣ �f

��ci,0

∣∣∣∣∣ �1.

Combining these inequalities, we get

|��i |�H∑

j=i+1

|�i,j‖��j | + |��ci,0|

which proves the theorem. �

Theorem 6.2 gives a recurrence relation for {��i}Hi=1, interms of {��c

i,0}. The following theorem provides the solutionto this recurrence.

Theorem 6.3. For 0� i�H − 1,

|��H−i |�H∑

k=H−i

|��ck,0|S(H − i, k),

where S(l, l) = 1, for 1� l�H , and for 1� l < m�H we have

S(l, m) =m−l+1∑

r=2

∑l=i1<i2<···<ir=m

|�i1,i2‖�i2,i3 | · · · |�ir−1,ir |.

Proof. We prove this statement by mathematical inductionon i.

86 S. Mukherjee, H. Kargupta / J. Parallel Distrib. Comput. 68 (2008) 78–92

Basis: [i = 0]. This follows directly from Theorem 6.2.Induction: Let us assume the induction hypothesis to be true

for i, and then we prove it for i + 1.Now, from Theorem 6.2,

|��H−(i+1)|� |��cH−(i+1),0| +

i∑j=0

|�H−(i+1),H−j‖��H−j |

� |��cH−(i+1),0|

+i∑

j=0

|�H−(i+1),H−j |⎛⎝ H∑

k=H−j

|��ck,0|S(H − j, k)

⎞⎠

= |��cH−(i+1),0|

+i∑

j=0

H∑k=H−j

|�H−(i+1),H−j ||��ck,0|S(H − j, k)

= |��cH−(i+1),0|

+H∑

k=H−i

i∑j=H−k

|�H−(i+1),H−j ||��ck,0|S(H − j, k)

= |��cH−(i+1),0|

+H∑

k=H−i

|��ck,0|

i∑j=H−k

|�H−(i+1),H−j |S(H − j, k)

= |��cH−(i+1),0| +

H∑k=H−i

|��ck,0|S(H − (i + 1), k)

=H∑

k=H−(i+1)

|��ck,0|S(H − (i + 1), k)

which proves the induction hypothesis. �

If we extend the definition of S(·, ·) so that S(l, m) = 0 ifl > m, then we can restate Theorem 6.3 as

|��H−i |�H∑

k=1

|��ck,0|S(H − i, k),

where 0� i�H − 1. Equivalently, we can write,

|��i |�H∑

k=1

|��ck,0|S(i, k),

where 1� i�H .Theorem 6.3 expresses the error |��i | in computing the

mean �i of the marginal distribution of X(i)h . If we define

the variational mean vector as μ = (�1, �2, . . . , �H ), thenthe following theorem gives the error ‖�μ‖ in computingthe variational mean, as a function of the error in distributedaveraging.

Theorem 6.4.

‖�μ‖�H∑

k=1

|��ck,0|M(k),

where

M(k) =H∑

i=1

S(i, k).

Proof. Since

μ = (�1, �2, . . . , �H )

we have

�μ = (��1, ��2, . . . ,��H ).

Thus,

‖�μ‖ �H∑

i=1

|��i |

�H∑

i=1

H∑k=1

|��ck,0|S(i, k)

=H∑

k=1

H∑i=1

|��ck,0|S(i, k)

=H∑

k=1

|��ck,0|

(H∑

i=1

S(i, k)

)

=H∑

k=1

|��ck,0|M(k),

where M(k) =∑Hi=1 S(i, k). This completes the proof. �

Finally, we can combine Theorems 6.1 and 6.4 to expressthe error in the variational mean vector, as a function of thenumber of updates in the network. The next theorem presentsthis result.

Theorem 6.5.

‖�μ‖�V �

(1− 8�∗

V 2

) U/LT �.

for some constant � > 0.

Proof. From Theorem 6.1 it follows that, for all 1�k�H ,

|��ck,0|�V Pk(0)

(1− 8�∗

V 2

) U/LT �.

In the above inequality, V appears as a multiplicative constant,because computing {�c

k,0}Hk=1 involves computing sums, which,in turn, are computed in a distributed fashion, by first calculat-ing averages, and then multiplying by the network size, V.

Thus, from Theorems 6.1 and 6.4 we have

‖�μ‖ �H∑

k=1

|��ck,0|M(k)

S. Mukherjee, H. Kargupta / J. Parallel Distrib. Comput. 68 (2008) 78–92 87

=H∑

k=1

V Pk(0)

(1− 8�∗

V 2

) U/LT �M(k)

= V �

(1− 8�∗

V 2

) U/LT �,

where

� =H∑

k=1

M(k)Pk(0).

This completes the proof. �

Our theoretical analysis establishes that the error decreasesexponentially as the number of updates (and, hence, the numberof messages) increases.

Now we will derive an expression for the energy consumptionof VIDE.

6.3. Energy analysis

This section presents an analysis of the total amount of en-ergy spent by the sensor nodes in VIDE. For simplicity, weshall make the following assumptions:(1) Since a major part of a sensor node’s energy is spent in

communication, we shall focus only on communication,and ignore the other factors, whose contribution to the totalenergy spent is usually much smaller. Also, we shall ignorethe effects of interference.

(2) We shall use the Rayleigh fading model to calculate theenergy requirements [21].

We begin by defining some parameters. Let N0 be the noisepower, � be the minimum value of signal-to-noise ratio forsuccessful transmission, dmax be the transmission range of anode, pr be the required minimum success probability of amessage transmission, and � be the attenuation coefficient.

Then, the following theorem gives an expression for the min-imum amount of energy required by VIDE to achieve a desiredaccuracy.

Theorem 6.6. The total energy EVIDE required by a sensornetwork performing VIDE is related to the error �μ as follows:

EVIDE �2HLTlog(|�μ|/V �)

log V

(d�

max�N0

− ln pr

),

where

V = 1− 8�∗

V 2

and the other symbols have the same meanings as in thepreceding theorems.

Proof. Theorem 6.5 shows that

|�μ|�V � U/LT �V

which can be rewritten as⌊U

LT

⌋� log(|�μ|/V �)

log V.

This gives a lower bound to the number of updates required:

U �LTlog(|�μ|/V �)

log V.

Now, in each update operation, the node that initiates theupdate sends its current average estimate. The receiver nodeupdates its average, and sends back the amount by which ithas changed its average. This is done for each hidden variable.Thus, the total number of floating point numbers communicatedduring one update is 2HU . We also consider this to be the totalnumber of messages required for an update, assuming that onemessage is required per floating point number.

It is shown in [21] that, for each unit of communication,the amount of energy spent in transmission is EL = d�

max�N0− ln pr

(our notation slightly differs from that in [21]). Thus, the totalamount of energy spent by the sensor network is bounded by

EVIDE �2HUEL.

That is,

EVIDE �2HLTlog(|�μ|/V �)

log V

(d�

max�N0

− ln pr

)

which completes the proof. �

Our communication analysis shows that, as the number ofmessages exchanged in the network increases, the error dropsexponentially. In the next section, we present experimental re-sults that validate this claim.

7. Experimental study

This section presents the results of our experimental evalua-tion of VIDE. It begins by defining the objectives of the exper-iments; then it describes the experimental setup, and, finally, itpresents the results.

7.1. Objectives

Clearly, the accuracy of the variational approximation de-pends on the accuracy of the distributed average computation.In our experiments, we wanted to see how close the distributedvariational approximation was to the variational approximationdone centrally.

Let μc be the solution vector to the mean-field equationscomputed by the centralized variational approximation. Thatis, �c

i is the mean of the ith hidden random variable underthis approximation. Let, in the distributed scenario, the solutionvector to the mean-field equations computed by the j th nodebe μ(j). Then the relative error for node j is err(j) = ‖μc −μ(j)‖/‖μc‖. The maximum relative error, then, is max_err =maxj {err(j)}.

We wish to observe the rate of decrease of max_err as thenumber of messages per node increases in the network.

Next, we discuss the experimental setup.

88 S. Mukherjee, H. Kargupta / J. Parallel Distrib. Comput. 68 (2008) 78–92

7.2. Experimental setup

In each experiment, we first generated the parameters of theBoltzmann machine, by selecting the value of each parameterin a uniformly random way, from the interval [0, 1). We usethese parameter values both for the centralized and the dis-tributed inferencing. Ideally, we would want to learn the Boltz-mann machine parameters from a training data set, and thenperform the testing. However, we have not yet addressed theproblem of learning these parameters in a heterogeneously dis-tributed environment. So we test only the inferencing capabilityof VIDE: given an arbitrary setting of the parameter values, wewant to know how accurate VIDE is compared to centralizedvariational inferencing.

Then, we generate a data set in which each vector representsan assignment of binary values to the visible variables. For eachsuch vector, the solution to the mean-field equations is com-puted both in a centralized way, and in a distributed way, usingVIDE. The maximum relative error max_err is then computed(as discussed above).

We have tested VIDE using both synthetic data sets, andreal-life time-series data sets, as described below.

Synthetic data sets: Synthetic data sets consist of uniformlyrandomly generated binary vectors with V dimensions, where Vis the network size, and, hence, the number of visible variables.The number of hidden variables, H, has been varied as follows:• For V = 100, we have selected H ∈ {1, 2, 3, 4, 5, 10}.• For V = 200, we have selected H ∈ {2, 4, 6, 8, 10, 20}.• For V = 500, we have selected H ∈ {5, 10, 15, 20, 25}.

Real-life time-series data sets

The real-life time series data sets used in this paper have beendownloaded from http://www.cs.ucr.edu/∼eamonn/time_series_data/ We have used two data sets, Adiac and 50words. In future,we would like to test VIDE with many more real-life data sets.

These data sets consist of vectors of real numbers. Wehave transformed these vectors to binary vectors using localgradient-based discretization. If the value of an attribute of thevector increases from one time instant to the next, then, in thebinary version of the vector, that attribute is assigned a value1, otherwise it is assigned a value 0.

The hidden and visible variables were chosen as follows:• Adiac: Each vector, in this data set, has 177 attributes. We

chose the left-most 100 attributes to be visible, and thenext H to be hidden, where H ∈ {1, 2, 3, 4, 5, 10}. Theremaining attributes were ignored.• 50words: Each vector, in this data set, has 271 attributes.

We chose the leftmost 200 attributes to be visible, and thenext H to be hidden, where H ∈ {2, 4, 6, 8, 10, 20}. Theremaining attributes were ignored.

We now present the results of our experiments.

7.3. Experimental results

As mentioned earlier, we wish to measure the accuracy ofVIDE as a function of communication bandwidth. In each of the

0 50 100 150 200 250 3000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Number of messages per node

Maxim

um

rela

tive e

rror

(avera

ged o

ver

100 trials

)

H=3, V=100

H=2, V=100

H=1, V=100

H=4, V=100

H=5, V=100

H=10, V=100

Fig. 1. Error vs. communication cost for V = 100 using synthetic data.

0 50 100 150 200 250 300 350 4000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Number of messages per node

Maxim

um

rela

tive e

rror

(avera

ged o

ver

100 trials

)

H=2, V=200

H=4, V=200

H=8, V=200

H=10, V=200

H=6, V=200

H=20, V=200

Fig. 2. Error vs. communication cost for V = 200 using synthetic data.

plots that follow, the x-axis represents the number of messagescommunicated by each node, where we assume that it takes asingle message to communicate a floating point number. They-axis represents the maximum relative error, max_err, aver-aged over a specified number of trials (also mentioned in thefigures).

First, we present the results for synthetically generated testdata.

Experiments with synthetic data: The dependence of max_erron the number of messages per node, for networks of size V =100, 200, and 500, is demonstrated in Figs. 1, 2, 3, respectively.

Experiments with real-life time-series data: The dependenceof max_err on the number of messages per node, for networksof size V = 100 (corresponding to data set Adiac) and V =200 (corresponding to data set 50words), is demonstrated inFigs. 4 and 5, respectively.

S. Mukherjee, H. Kargupta / J. Parallel Distrib. Comput. 68 (2008) 78–92 89

0 100 200 300 400 500 600 7000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Number of messages per node

Maxim

um

rela

tive e

rror

(avera

ged o

ver

100 trials

)

H=5, V=500

H=10, V=500

H=15, V=500

H=25, V=500

H=20, V=500

Fig. 3. Error vs. communication cost for V = 500 using synthetic data.

0 50 100 150 200 250 3000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Number of messages per node

Maxim

um

rela

tive e

rror

(avera

ged o

ver

390 trials

)

H=3, V=100

H=2, V=100

H=1, V=100

H=4, V=100

H=5, V=100

H=10, V=100

Fig. 4. Error vs. communication cost for V = 100 using Adiac data set.

0 50 100 150 200 250 300 350 4000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Number of messages per node

Maxim

um

rela

tive e

rror

(avera

ged o

ver

454 trials

)

H=2, V=200

H=4, V=200

H=8, V=200

H=10, V=200

H=20, V=200

H=6, V=200

Fig. 5. Error vs. communication cost for V = 200 using 50words data set.

7.4. Remarks

Let us consider the experiment with synthetic data, withthe number of visible variables V = 100. As indicated byFig. 1, the maximum relative error falls exponentially, asthe number of messages per node increases. This meansthat the behavior of centralized variational inference canbe closely approximated even with a modest amount ofcommunication.

Another interesting parameter to note from Fig. 1 is the rateat which the maximum relative error decreases as the numberof messages per node increases. It is seen from Fig. 1 that thisrate decreases, as the number of hidden variables, H, increases.At H = 1, the maximum relative error falls very sharply. Butat H = 10, the rate of decrease is very low.

The reason for this is that, during the distributed averagecomputation phase, the amount of data communicated be-tween a pair of nodes, during each update operation, increaseslinearly with H. This means, for the same communicationbandwidth allowance, the number of update operations will behigher, leading to greater accuracy, if H is lower. In particular,let U be the number of updates. Each pairwise update in-volves the exchange of 2H floating point numbers. Assumingthat one message is required for communicating each floatingpoint number, the total number of messages required, then, is2HU (as shown in the proof of Theorem 6.6), and the num-ber of messages per node is n = 2HU/V . We can rewritethis as

U = 1

2n

V

H.

Thus, for the same number of messages per node, the numberof updates is greater when the V/H is greater, that is, when His smaller compared to V.

The above analysis also suggests that if we change V andH keeping their ratio constant, then the rate of decrease of themaximum relative error should be comparable.

This conjecture is strengthened by the results reported inFigs. 2 and 3, and even by the experiments done with real-lifedata, as shown in Figs. 4 and 5. We see that the curves cor-responding to H = 1, V = 100 (Fig. 1), H = 2, V = 200(Fig. 2), H = 5, V = 500 (Fig. 3), H = 1, V = 100 (Fig. 4),and H = 2, V = 200 (Fig. 5) all have the same shape. More-over, they correspond to the same value of V/H which is 100.Other sets of curves having the same V/H have the same prop-erty: the rate of decrease of maximum relative error is similar.

We next present experimental results to demonstrate the scal-ability of VIDE.

7.5. Experimental results: scalability

In this section, we investigate the scalability of VIDE. Inorder to evaluate the scalability of VIDE, we need to examinehow the number of messages transmitted per node increaseswith network size, in order to achieve a given level of accuracy,for a given number of hidden variables. We consider scalabilityboth for synthetic and real-life time-series data.

90 S. Mukherjee, H. Kargupta / J. Parallel Distrib. Comput. 68 (2008) 78–92

−0.2 −0.15 −0.1 −0.05 00

100

200

300

400

500

600

700

Accuracy= − maximum relative error

Num

ber

of m

essages tra

nsm

itte

d p

er

node

H=10, V=100

H=10, V=200

H=10, V=500

Fig. 6. Scalability of VIDE—experiments with synthetic data. Here, H and Vrepresent the number of hidden variables and the number of visible variables,respectively. V is also the size of the network.

Scalability: synthetic data: The scalability of VIDE with syn-thetic data is demonstrated in Fig. 6. In this figure, the x-axisdenotes the desired level of accuracy. Since accuracy increasesas the maximum relative error decreases, we define accuracyas the negative of the maximum relative error. In the y-axis, weplot the number of messages necessary to transmit per node,for various network sizes.

Remark. It is seen from Fig. 6 that, for reasonably high levelsof accuracy, the plots for the three network sizes are very closeto each other. This indicates that, for a given accuracy level,the number of messages transmitted per node increases veryslightly, as the network becomes larger. This demonstrates thatVIDE is extremely scalable with synthetic data.

Next, we investigate the scalability of VIDE with real-lifetime-series data.

Scalability: Real-life time-series data: For real-life time-series experiments, we select the same data sets, Adiac and50words, as described earlier. The scalability of VIDE withreal-life data is demonstrated in Fig. 7. In this figure, the x-axisdenotes the desired level of accuracy. Since accuracy increasesas the maximum relative error decreases, we define accuracyas the negative of the maximum relative error. In the y-axis, weplot the number of messages necessary to transmit per node,for various network sizes. For V = 100, we use the Adiac dataset; for V = 200, we use the 50words data set.

Remark. It is seen from Fig. 7 that, even when the network sizeis doubled, the number of messages transmitted per node, for agiven level of accuracy, increases very slightly. This indicatesthat VIDE is scalable even with real-life data.

To summarize, the experimental results show that thecommunication requirements for VIDE are modest; even with

−0.2 −0.18 −0.16 −0.14 −0.12 −0.1 −0.08 −0.06 −0.040

50

100

150

200

250

300

350

400

Accuracy= − maximum relative error

Num

ber

of m

essages tra

nsm

itte

d p

er

node

H=10, V=100

H=10, V=200

Fig. 7. Scalability of VIDE—experiments with real-life time-series data. Here,H and V represent the number of hidden variables and the number of visiblevariables, respectively. V is also the size of the network.

limited communication allowance, VIDE can achieve suffi-ciently accurate results. Moreover, VIDE is a scalable tech-nique, because the communication requirements per nodeincreases at a very slow rate, as the network becomes larger.Thus, our experimental study establishes VIDE as a deter-ministic approximation technique for performing distributedinferencing in a communication-efficient and scalable manner.This seems to empirically back up the claim for local behavior.The next section concludes the paper.

8. Conclusion and future work

DDM problems are often challenging because of communi-cation bandwidth limitations. A lot of contemporary researchin this field aims at finding probabilistic approximation algo-rithms for solving these problems. This paper takes a differentapproach, and explores the applicability of deterministic ap-proximations to DDM problems. In particular, it demonstratesthe applicability of a deterministic approximation technique, thevariational method, to the problem of distributed inferencingin a graphical model. It extends the variational formulation ofthe probabilistic inferencing problem to the heterogeneouslydistributed scenario. Based on the variational formulation, itpresents a deterministic approximation algorithm, VIDE, fordistributed inferencing. The theoretical analysis and experi-mental results shown in this paper indicate that VIDE is acommunication-efficient solution to the problem of probabilis-tic inferencing in heterogeneously distributed environments.

Natural extensions of this work would involve exploringwhether other DDM problems can be formulated using the vari-ational optimization framework, and whether the variationalmethod leads to communication-efficient techniques for solv-ing these problems. We would like to investigate these issuesin the future.

S. Mukherjee, H. Kargupta / J. Parallel Distrib. Comput. 68 (2008) 78–92 91

Acknowledgments

The authors would like to thank the partial support pro-vided by the National Science Foundation Career Award, IIS-0093353. The authors would also like to thank Dr. EamonnKeogh for making available useful time-series data sets athttp://www.cs.ucr.edu/∼eamonn/time_series_data/.

References

[1] D.H. Ackley, G.E. Hinton, T.J. Sejnowski, A learning algorithm forBoltzmann machines, Cognitive Sci. 9 (1985).

[2] V. Byckovskiy, et al., A collaborative approach to in-place sensorcalibration, in: International Conference on Information Processing inSensor Networks (IPSN), 2003.

[3] L. Chen, M. Wainwright, M. Cetin, A. Willsky, Data association basedon optimization in graphical models with application to sensor networks.Math. Comput. Modelling, (Special issue on optimization and controlfor military applications) 43 (9–10) (2006) 1114–1135.

[4] C. Crick, A. Pfeffer, Loopy belief propagation as a basis forcommunication with sensor networks, in: Uncertainty in ArtificialIntelligence 18 (2003).

[5] S. Dutta, C. Gianella, H. Kargupta, K-means clustering over peer-to-peer networks, in: The 8th International Workshop on High Performanceand distributed Mining, SIAM International Conference on Data Mining,2005.

[6] E.B. Ermis, V. Saligrama, Detection and localization in sensor networksusing distributed FDR, in: Conference on Information Sciences andSystems (CISS), 2006.

[7] C. Giannella, H. Dutta, S. Mukherjee, H. Kargupta, Efficient kerneldensity estimation over distributed data, in: The 9th InternationalWorkshop on High Performance and Distributed Mining, SIAMInternational Conference on Data Mining, Bethesda, USA, April 2006.

[8] G. Hinton, Training products of experts by minimizing contrastivedivergence, Technical Report: GCNU TR 2000-004, 2000.

[9] J.J. Hopfield, Learning algorithms and probability distributions in feed-forward and feed-back networks, Proc. Nat. Acad. Sci. USA 84 (23)(1987) 8429–8433.

[10] K. Horowitz, D. Malkhi, Estimating network size from local information,Inform. Process. Lett. J. 88 (5) (2003) 237–243.

[11] A.T. Ihler, Inference in sensor networks: graphical models and particlemodels, Ph.D. Thesis.

[12] T.S. Jaakkola, Tutorial on variational approximation methods, in:D. Saad, M. Opper (Eds.), Advanced Mean Field Methods—Theory andPractice, MIT Press, Cambridge, MA.

[13] F.V. Jensen, An Introduction to Bayesian Networks, London UCL Press,1996.

[14] M. Jordan, C.M. Bishop, Neural networks, ACM Computing Surveys 28(1) (1996) 73–75.

[15] M. Jordan, et al., An introduction to variational methods for graphicalmodels, in: M. Jordan (Ed.), Learning in Graphical Models, Cambridge:MIT Press, 1999.

[16] H. Kargupta, I. Hamzaoglu, B. Stafford, Scalable, distributed datamining using an agent-based architecture, in: Proceedings of KnowledgeDiscovery and Data Mining, D. Heckerman, H. Mannila, D. Pregibon,R. Uthurusamy (Eds.), AAAI Press, 1997, pp. 211–214.

[17] H. Kargupta, B. Park, D. Hershberger, E. Johnson, Collective datamining: a new perspective toward distributed data mining, in:H. Kargupta, P. Chan (Eds.), Advances in Distributed and ParallelKnowledge Discovery, MIT/AAAI Press, 1999.

[18] H. Kargupta, K. Sivakumar, Existential pleasures of distributed datamining, in: Data mining: Next Generation Challenges and FutureDirections, in: H. Kargupta, A. Joshi, K. Sivakumar, Y. Yesha (Eds.),AAAI/MIT Press, 2004.

[19] W. Kowalczyk, N. Vlassis, Newscast EM, in: Advances in NeuralInformation Processing Systems, vol. 17, MIT Press, Cambridge, MA,2005.

[20] C. Lin, W. Peng, Y. Tseng, Efficient in-network moving object trackingin wireless sensor networks, IEEE Trans. Mobile Comput. 5 (8) (2006)1044–1056.

[21] X. Liu, M. Haenggi, Performance analysis of Rayleigh fading adhoc networks with regular topology, in: Global TelecommunicationsConference 5, 2005, GLOBECOM ’05, IEEE, November–December2005.

[23] D. McKay, Variational methods, in: Information Theory, Inference andLearning Algorithms, Cambridge University Press, Cambriged, 2003.

[24] M. Mehyar, D. Spanos, J. Pongsajapan, S.H. Low, R. Murray, Distributedaveraging on peer-to-peer networks, in: The 4th International Symposiumon Information Processing in Sensor Networks, Los Angeles, CA, USA,April 2005.

[25] R. Motwani, P. Raghavan. Randomized Algorithms, CambridgeUniversity Press, New York, NY, 1995.

[26] M. Paskin, C. Guestrin, J. McFadden, A robust architecture forinference in distributed sensor networks, in: International Conference onInformation Processing in Sensor Networks (IPSN), 2005.

[27] M.A. Paskin, C.E. Guestrin, Robust probabilistic inference in distributedsystems, in: Uncertainty in Artificial Intelligence 20 (2004).

[28] C. Peterson, J.R. Anderson, A mean field theory learning algorithm forneural networks, Complex Systems 1 (1987).

[29] S.M. Rüger, Decimatable Boltzmann machines for diagnosis: efficientlearning and inference, in: IMACS World Congress, 1997.

[30] L.K. Saul, M.I. Jordan, Learning in Boltzmann trees, Neural Comput.6 (6) (1994).

[31] Y. Teh, G. Hinton, Rate-coded restricted Boltzmann machines for facerecognition, in: Advances in Neural Information Processing Systems,vol. 13, MIT Press, Cambridge, MA, 2001.

[32] J.J. Verbeek, N. Vlassis, J.R.J. Nunnink, A variational EM approachfor large-scale mixture modelling, in: Proceedings of the 8th AnnualConference of the Advanced School of Computing and Imaging, Heijen,The Netherlands, June 2003.

[33] R. Wolff, A. Schuster, Association rule mining in peer-to-peer systems,IEEE Trans. Systems Man Cybernet. Part B 34 (6) (2004) 2426–2438.

[34] S.A. Zabidi, M.E. Salami, Design and development of intelligentfingerprint-based security system, Knowledge-Based Intelligent Inform.and Eng. Systems 3214 (2004) 312–318, Lecture Notes in ComputerScience, Springer, Berlin, Heidelberg, October 2004.

Sourav Mukherjee is a graduate student, pur-suing a PhD program in the Department ofComputer Science at the University of Mary-land, Baltimore County, USA. He received hisMS in Computer Science from the University ofMaryland, Baltimore County, USA, in 2007, andhis BE in Computer Science and Engineeringfrom Jadavpur University, India, in 2005. His re-search interests include data mining, distributeddata mining, and machine learning. He regu-larly reviews research articles in several leadingdata mining conferences. In recognition of his

academic achievements, he has been awarded membership into the HonorsSociety of Phi Kappa Phi in 2007.

Hillol Kargupta is an associate professor in theDepartment of Computer Science and ElectricalEngineering, University of Maryland, BaltimoreCounty. He received the PhD degree in Com-puter Science from the University of Illinois atUrbana-Champaign in 1996. He is also a co-founder of Agnik, LLC, a ubiquitous data intel-ligence company. His research interests includemobile and distributed data mining and compu-tation in biological process of gene expression.He won a US National Science Foundation CA-REER Award in 2001 for his research on ubiq-uitous and distributed data mining. He along

with his coauthors received the best paper award at the 2003 IEEE Inter-national Conference on Data Mining for a paper on privacy-preserving data

92 S. Mukherjee, H. Kargupta / J. Parallel Distrib. Comput. 68 (2008) 78–92

mining. He won the 2000 TRW Foundation Award and the 1997 Los AlamosAward for Outstanding Technical Achievement. His research has been fundedby the US National Science Foundation, US Air Force, Department of Home-land Security, NASA, and various other organizations. He has publishedmore than 80 peer-reviewed articles in journals, conferences, and books. Hehas coedited two books: Advances in Distributed and Parallel Knowledge

Discovery, AAAI/MIT Press, and Data Mining: Next Generation Challengesand Future Directions, AAAI/MIT Press. He is an associate editor of the IEEETransactions on Knowledge and Data Engineering and IEEE Transactions onSystems, Man, and Cybernetics, Part B. He regularly serves in the organizingand program committee of many data mining conferences. More informationabout him can be found at http://www.cs.umbc.edu/∼hillol.