16
J. Stat. Mech. (2009) P07006 ournal of Statistical Mechanics: An IOP and SISSA journal J Theory and Experiment Finding hypergraph communities: a Bayesian approach and variational solution Alexei Vazquez The Simons Center for Systems Biology, Institute for Advanced Study, Einstein Drive, Princeton, NJ 08540, USA E-mail: [email protected] Received 4 January 2009 Accepted 5 May 2009 Published 1 July 2009 Online at stacks.iop.org/JSTAT/2009/P07006 doi:10.1088/1742-5468/2009/07/P07006 Abstract. Data clustering, including problems such as finding network communities, can be put into a systematic framework by means of a Bayesian approach. Here we address the Bayesian formulation of the problem of finding hypergraph communities. We start by introducing a hypergraph generative model with a built-in group structure. Using a variational calculation we derive a variational Bayes algorithm, a generalized version of the expectation maximization algorithm with a built-in penalization for model complexity or bias. We demonstrate the good performance of the variational Bayes algorithm using test examples, including finding network communities. A MATLAB code implementing this algorithm is provided as supplementary material. Keywords: data mining (theory), random graphs, networks c 2009 IOP Publishing Ltd and SISSA 1742-5468/09/P07006+16$30.00

Finding hypergraph communities: a Bayesian approach and

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

J.Stat.M

ech.(2009)

P07006

ournal of Statistical Mechanics:An IOP and SISSA journalJ Theory and Experiment

Finding hypergraph communities: aBayesian approach and variationalsolution

Alexei Vazquez

The Simons Center for Systems Biology, Institute for Advanced Study,Einstein Drive, Princeton, NJ 08540, USAE-mail: [email protected]

Received 4 January 2009Accepted 5 May 2009Published 1 July 2009

Online at stacks.iop.org/JSTAT/2009/P07006doi:10.1088/1742-5468/2009/07/P07006

Abstract. Data clustering, including problems such as finding networkcommunities, can be put into a systematic framework by means of a Bayesianapproach. Here we address the Bayesian formulation of the problem of findinghypergraph communities. We start by introducing a hypergraph generativemodel with a built-in group structure. Using a variational calculation wederive a variational Bayes algorithm, a generalized version of the expectationmaximization algorithm with a built-in penalization for model complexity orbias. We demonstrate the good performance of the variational Bayes algorithmusing test examples, including finding network communities. A MATLAB codeimplementing this algorithm is provided as supplementary material.

Keywords: data mining (theory), random graphs, networks

c©2009 IOP Publishing Ltd and SISSA 1742-5468/09/P07006+16$30.00

J.Stat.M

ech.(2009)

P07006

Finding hypergraph communities

Contents

1. Introduction 2

2. Previous work 32.1. Bayesian approach and variational solution . . . . . . . . . . . . . . . . . . 32.2. Statistical model with a population structure . . . . . . . . . . . . . . . . . 52.3. Prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4. Mean-field approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3. Finding hypergraphs and graph communities 83.1. VB algorithm implementation . . . . . . . . . . . . . . . . . . . . . . . . . 113.2. Test example: zoo problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3. Test example: finding network modules . . . . . . . . . . . . . . . . . . . . 12

4. Conclusions 15

Acknowledgments 15

References 16

1. Introduction

Mixture models provide an intuitive statistical representation of data sets structured ingroups, clusters or classes [1], decomposing a complex data set into the superposition ofsimpler data sets. The inverse problem consists in determining the group decompositionand the statistical parameters characterizing each group. For a fixed number of groupsthe expectation maximization (EM) algorithm provides a recursive solution to the inverseproblem [2]. The estimation of the right number or groups has been, however, agreat challenge. Corrections such as the Akaike information criterion (AIC) [3] and theBayesian information criterion (BIC) [4] have been derived, penalizing model complexityand overfitting. Yet, the number of groups estimated from these criteria is in generalunsatisfactory [5].

In contrast, a Bayesian approach would not attempt to estimate what is the ‘optimal’number of groups, but instead would average over models with different numbers ofgroups [6]. The Bayesian approach is becoming a popular technique for solving problemsin data analysis, model selection and hypothesis testing [7]–[9]. Many of the original ideascome from the early work of Jeffreys [6], but it is just recently that they are startingto be used widely [7]–[9]. The application of Bayesian approaches to real problems canbe, however, quite challenging. In most cases the solution is explored via Monte Carlosampling [10, 8] or variational methods [8, 11, 12]. The application of variational methodsto Bayesian problems results in the variational Bayes (VB) algorithm [8, 11]. The VBalgorithm is a set of self-consistent equations analogous to the EM algorithm. They canbe solved recursively, obtaining an approximate solution to the inverse inference problem.These methods have been applied, for example, to Gaussian mixture models for real value

doi:10.1088/1742-5468/2009/07/P07006 2

J.Stat.M

ech.(2009)

P07006

Finding hypergraph communities

data [1, 13], Dirichlet mixture models for categorical data [14] and the problem of findinggraph modules [15].

Several real systems can be represented by a hypergraph. A hypergraph is anintuitive extension of the concept of a graph or network where the nodes representthe system elements and the edges (also called hyperedges) are sets of any number ofelements. For example, a publication database can be mapped into a hypergraph, withthe vertices representing authors and the edges representing papers. Other examples areprotein complexes, providing associations between one, two or more proteins. These sameexamples could be mapped into graphs as well, but distorting the actual nature of thesystem. Nevertheless, from the network approaches to these and other systems we havelearned that the vertices are organized in network modules or communities [16]–[20], [15],and the same is expected for the hypergraph representation.

Inspired by a statistical model on graphs [20] we have recently introduced a statisticalmodel on hypergraphs [21], as a means to infer hypergraph communities. Here we furtherstudy this statistical model using a Bayesian approach and a variational solution. Insection 2 we review previous work on the Bayesian approach and its variational solution.Then in section 3 we study the problem of finding hypergraph communities. Afterintroducing a generative model for hypergraphs, we obtain a VB algorithm. Becauseof its Bayesian root, the VB algorithm has a built-in correction for model complexity orbias and, therefore, it does not require the use of additional complexity criteria. Theperformance of the VB algorithm is tested in some examples, and we obtain satisfactoryresults whenever there is a significant distinction between the groups.

2. Previous work

2.1. Bayesian approach and variational solution

The Bayesian approach is a systematic methodology for interpreting complex data setsand for evaluating model hypotheses. Its main ingredients or steps are: given a data set D,(i) introduce a statistical model with model parameters, φ, (ii) write down the likelihoodof observing the data given the proposed model and parameters, P (D|φ), (iii) determinethe prior distribution for the model parameters based on our current knowledge, P (φ),and, finally, (iv) invert the statistical model of the data given the likelihood and priordistribution to obtain the posterior distribution of the model parameters given the modeland data, P (φ|D). The latter step is based on Bayes’ rule

P (φ|D) =1

ZP (D|φ)P (φ) (1)

where

Z = P (D) =

∫dφP (D|φ)P (φ). (2)

Having obtained the distribution of the model parameters, at least formally, we candetermine other magnitudes. For example, the average of a quantity A(φ) is given by

〈A(φ)〉 =

∫dφP (φ|D)A(φ). (3)

doi:10.1088/1742-5468/2009/07/P07006 3

J.Stat.M

ech.(2009)

P07006

Finding hypergraph communities

In practice calculating (2) or (3) is a formidable task. A very powerful approximationscheme is the variational method [8, 11]. The main idea of the variational method isto approximate the generally difficult to handle distribution P (φ|D) by a distributionQ(φ|D) of a more tractable form. In the following we omit the dependence of Q on Dand just write Q(φ). Given Q(φ) we can obtain a bound for F = − ln Z using Jensen’sinequality

F = − ln Z = − ln

∫dφ Q(φ)

P (D|φ)P (φ)

Q(φ)≤ −

∫dφ Q(φ) ln

P (D|φ)P (φ)

Q(φ). (4)

The latter equation can be rewritten as [8]

F ≤ U − TS (5)

where T = 1,

U = −∫

dφ Q(φ) lnP (D|φ) (6)

is minus the average log-likelihood and

S = −∫

dφ Q(φ) lnQ(φ)

P (φ)(7)

is the Kullback–Leibler divergence of Q(φ) relative to the prior distribution P (φ) [22].Equation (5) resembles the usual free energy in statistical mechanics: F = U − TS,where U , S and T are the internal energy, entropy and temperature of the system, thetemperature being expressed in units of the Boltzmann constant kB. Minus the averagelog-likelihood plays the role of the internal energy, the Kullback–Leibler divergence ofQ(φ) plays the role of the entropy and temperature equals 1.

Equation (5) emphasizes the two components determining the best choice ofvariational distribution Q(φ): better fit to the data and model bias. How well the dataare fitted is quantified by the internal energy U (6). To achieve the best fit, or internalenergy ground state, Q(φ) should be concentrated around the regions of the parameterspace where P (D|φ) is maximum. The best choice in this respect will be the maximumlikelihood estimate (MLE)

QMLE(φ) = δ(φ − φ∗) (8)

where

φ∗ = maxφ

ln P (D|φ). (9)

In the opposite extreme, when no data are presented to us, the best distribution isthat maximizing the Kullback–Leibler divergence relative to the prior distribution. Thismaximum entropy (ME) solution is the prior distribution itself

QME(φ) = P (φ). (10)

In general, the drive to better fit the data is opposed by the tendency to obtain the leastbiased model. The variational solution is therefore in the middle between the one extremeof biased models fitting the data very well and completely unbiased models giving a badfit to the data. It is obtained after minimizing the right-hand side of (5) with respect toQ(φ) over a restricted class of functions. This variational solution Q(φ) represents theclosest distribution to P (φ|D) within the class of functions considered.

doi:10.1088/1742-5468/2009/07/P07006 4

J.Stat.M

ech.(2009)

P07006

Finding hypergraph communities

2.2. Statistical model with a population structure

In this section we present the generalities of statistical models with a first-level populationstructure. Similar models have been studied in [14, 15]. Our working hypothesis isthat there is a hidden population structure, characterized by the subdivision of thepopulation into groups. We assume that we are given a data set D which, in someway to be determined, reflects the population structure. The problem consists in inferringthis hidden structure and the associated model parameters from the data. To tacklethis problem we introduce a statistical model with a built-in population structure as agenerative model of the data. The population structure and the model parameters arethen inferred by solving the inverse problem. More precisely:

(i) We consider a population composed of n elements, individuals or samples divided intoK groups.

(ii) The sample assignment to groups is generated by a multinomial model withprobabilities πk, k = 1, . . . , K. Denoting by gi the group to which the ith samplebelongs, we obtain

P (g|π) =

n∏i=1

πgi. (11)

(iii) Given the group assignments gi, and depending on the data set, we write down thelikelihood P (D|g, θ) of observing the data parametrized by the parameter set θ.

(iv) Putting all this together we obtain the posterior distribution

P (φ|D) =1

ZP (D|g, θ)P (g|π)P (θ)P (π)P (K) (12)

where φ = (g, θ, π, K) and P (θ), P (π) and P (K) are the prior distributions of θ, πand K.

The form of the prior distributions, except for P (K), is the subject of the next section.The distribution P (K) is irrelevant for problems with large data sets. The differencebetween the log-likelihoods of models with different values of K is in general of the orderof the data set size and, as a consequence, the contribution of lnP (K) is negligible. Thus,in the following sections we simply neglect the contribution given by P (K). Finally, wespecify the likelihood P (D|g, θ) when addressing specific problems.

2.3. Prior distributions

The choice of the prior distribution P (φ) is probably one of the less obvious topics inBayesian analysis. Currently the predominant choice is the use of conjugate priors. Theform of conjugate priors is indicated by the likelihood, making the prior selection lessambiguous. For example, the binomial likelihood P (n|p) ∝ pn(1 − p)N−n suggests abeta distribution for P (p|n). Furthermore, on choosing a beta distribution as a prior,

P (p) ∝ pα̃−1(1 − p)β̃−1, the posterior distribution remains a beta distribution, but with

exponents α = α̃ + n and β = β̃ + N − n. In this sense, the beta distribution is theconjugate prior of the binomial likelihood.

doi:10.1088/1742-5468/2009/07/P07006 5

J.Stat.M

ech.(2009)

P07006

Finding hypergraph communities

Yet, the fact that the form of conjugate priors is suggested by the likelihood doesnot demonstrate that they are the correct choice of priors. Moreover, even if we accepttheir use, it is not clear what is the correct choice for the prior distribution parameters,e.g. α̃ and β̃. Different methods have been proposed for determining these parameters. Ingeneral they are based on a posteriori analyses, e.g. calculations making use of the datain some way or another. Such methods violate, however, the concept of prior distribution,defined as the distribution of the model parameters in the absence of the data.

An alternative approach is that of Jaynes [23]. According to Jaynes, in the absenceof any data, the priors should be solely determined on the basis of the symmetries andconstraints of the problem under consideration. In this work we make use of Jaynes’approach to determine the prior distribution. In our case we need to determine the priordistributions for the probabilities θ and π, characterizing binomial and multinomial modelsrespectively. Thus, let us consider the multinomial model with K states

P (n|π) =(∑K

k=1 nk)!∏Ki=1 πk

K∏k=1

πnkk (13)

where nk is the number of times state k was observed and πk is the probability of observingstate k in one trial, 0 ≤ πk ≤ 1 and

∑Kk=1 πk = 1. The case K = 2 corresponds to the

binomial model and it was already addressed by Jaynes [23]. Here we use a similarapproach to derive the prior distribution for the multinomial model.

The probabilities πk may be different depending on our beliefs, e.g. all states areequally probable. Different investigators may have different beliefs, resulting in differentchoices of πk. The main assumption is that the prior distribution should be independent ofwhat is our specific belief and, therefore, should be invariant under a belief transformation.

Belief transformation: Let us represent by Sk the state k, and let P (Sk|E) andP (Sk|E ′) be the probabilities of observing state Sk in one trial according to belief E andE ′, respectively. From Bayes’ rule it follows that

P (Sk|E ′) =P (E ′|Sk, E)P (Sk|E)∑Kj=1 P (E ′|Sj , E)P (Sj|E)

(14)

for k = 1, . . . , K. The latter equation can be rewritten as

π′k =

ak

Aπk (15)

for k = 1, . . . , K − 1 and π′K = 1 −

∑K−1k=1 π′

k, where πk = P (Sk|E), π′k = P (Sk|E ′),

ak =P (E ′|Sk, E)

P (E ′|SK , E)(16)

and

A = 1 +K−1∑k=1

(ajk − 1)πk. (17)

Equation (15) provides the rules of transformation of the probabilities πk from one systemof belief to another.

doi:10.1088/1742-5468/2009/07/P07006 6

J.Stat.M

ech.(2009)

P07006

Finding hypergraph communities

The invariance under the above transformation led to the following functionalequation for the prior distribution P (π):

P (π′) dπ′ = P (π) dπ. (18)

To solve this equation we first need to compute the determinant of the transformationJacobian. The Jacobian of the transformation (15) has the matrix elements

Jij =∂π′

i

∂πj

=aiδij

A− ai(aj − 1)πi

A2, (19)

i, j = 1, . . . , K − 1. This matrix can be decomposed into the product J = BC, where B(Bij = aiδij/A) is a diagonal matrix and C (Cij = δij − (aj − 1)πi/A) has two eigenvalues,λ1 = A−1 and an n − 2-degenerate eigenvalue λ2 = 1. Putting all of this together weobtain

|J | = |B|λ1λn−22 =

1

An

K∏k=1

ak. (20)

The solution of (18), with dπ′ = |J | dπ, is given by

P (π) = const. ×K∏

i=1

π−1i . (21)

Note that for K = 2, π1 = p and π2 = 1 − p, we recover the result of Jaynes for thebinomial model [23]

P (p) ∝ p−1(1 − p)−1. (22)

The prior distributions (21) and (22) are improper, i.e. their integral over theparameter space is not finite. At first this may sound an unsuitable property for aprior distribution. Nevertheless, the improper nature of these prior distributions is justindicating that the symmetries in our problem are not sufficient to fully determine them.Data are required to obtain a proper distribution. The best example for an intuitiveunderstanding of these arguments is the prior distribution of a location parameter. In theabsence of any data and under the assumption of translational invariance, it is clear thatevery value in the real line is an equally probable value for the location parameter, resultingin an improper prior. From the operational point of view, the posterior distributionmay be proper even when the prior is not. Indeed, the integral

∫dφP (φ) may be

improper, while∫

dφP (φ|D) ∝∫

dφP (D|φ)P (φ) may be proper. On the other hand, theposterior distribution can be improper when the inference problem has not been correctlyformulated or there are not sufficient data to determine the model parameters.

To avoid dealing with improper distributions, we can renormalize improper priors tosome limit of a proper distribution. Since conjugate priors facilitate analytical calculationsthey are a good starting point. In particular, for the multinomial probabilities π we usethe renormalized invariant priors

P (π) =1

Beta(γ̃)

K∏k=1

πγ̃k−1k = D(π; γ̃) (23)

doi:10.1088/1742-5468/2009/07/P07006 7

J.Stat.M

ech.(2009)

P07006

Finding hypergraph communities

with γ̃ → 0, where D(π; γ̃) is the generalized beta distribution (Dirichlet distribution)and

Beta(γ̃) =

∏Kk=1 Γ(γ̃k)

Γ(∑K

k=1 γ̃k)(24)

is the generalized beta function.

2.4. Mean-field approximation

In this section we specify the form of the variational function Q(φ). To allow foran analytical solution we neglect correlations between the group assignments and theremaining model parameters. We denote by pik the probability that sample i belongsto group k. Furthermore, given that θ and π always appear in different factors in (12),their joint distribution factorizes. Within the mean-field approximation for the groupassignments and the later factorization the variational function can be written as

Q(φ) =∏

i

pigiR1(θ)R2(π). (25)

Summarizing, in the case studies below, we are going to solve the generativemodel (12), making use of renormalized invariant prior (23) and the MF variationalfunction (25). This approach is based on the assumptions that: the population is dividedinto groups, the group assignments are generated by a multinomial model, the priors arerenormalized invariant distributions, and a MF approximation of the variational solutionwith respect to the group assignments.

3. Finding hypergraphs and graph communities

A hypergraph is an intuitive extension of the concept of a graph or network where thenodes represent the system elements and the edges (also called hyperedges) are sets ofany number of elements (figure 1(a)). This mathematical construction is very usefulfor representing a population of elements and their attributes. For example, considerthe animal population in figure 2(a) together with their attributes: habitat, nutritionbehavior, etc. In this case the hypergraph nodes represent animals. Furthermore, we canuse an edge to represent the association between all animals with a given attribute: edge1, all non-airborne animals; edge 2, all airborne animals; and so on (figure 2(b)).

The hypergraph representing the data may be characterized by some structure inthe form of communities or modules, groups of vertices having similar connectivitypatterns [16]–[20, 15]. This structure can be encoded into a generative hypergraph modelproviding the probability of observing a hypergraph given the communities. Specifically,we consider a statistical model previously introduced in the context of graphs [20] andlater extended to hypergraphs [21]:

Data: Consider a hypergraph with a vertex set representing n samples and m edgescharacterizing the relationships among them. The hypergraph is specified by its adjacencymatrix a, where aij = 1 if element i belongs to edge j and it is 0 otherwise.

doi:10.1088/1742-5468/2009/07/P07006 8

J.Stat.M

ech.(2009)

P07006

Finding hypergraph communities

(a)

(c)3

4

2

4

3

1

(d)

(b)

Figure 1. Hypergraph. (a) A hypergraph with three edges. Each edgeis represented by a circle and its composed by the nodes within the circle.(b) Bipartite graph representation of the hypergraph in (a), the squaresrepresenting the hypergraph edges. ((c), (d)) Nearest-neighbor mapping of agraph onto a hypergraph. The graph in (c) is mapped onto the hypergraph in(d), where each hyperedge represents a set of nearest neighbors of a node in graph(c), as indicated by the enumeration.

Likelihood: The adjacency matrix elements are generated by a binomial model withgroup dependent probabilities θkj, k = 1, . . . , K and j = 1, . . . , m, resulting in

P (a|g, θ) =∏ij

θaij

gij(1 − θgij)

1−aij . (26)

Priors: As priors we use the renormalized invariant prior of the binomial model (caseK = 2 in (23)). Taking into account that we have a binomial model for each pair of groupand edge, we obtain

P (θ) =∏kj

B(θkj; α̃kj, β̃kj) (27)

with α̃kj → 0 and β̃ → 0.Substituting the likelihood (26), the priors (23) and (27), and the MF variational

function (25) into (5), and integrating over φ (summing over gi and integrating over θkl

and πk) we obtain the MF variational upper bound

doi:10.1088/1742-5468/2009/07/P07006 9

J.Stat.M

ech.(2009)

P07006

Finding hypergraph communities

a b

c

d

Figure 2. Stratification of an animal population. (a) A list of animals is giventogether with certain attributes characterizing them. The complete data setis available from [24]. Except for one attribute—legs—one and zero indicatepossession or not, respectively, of the corresponding attribute. The problemconsists in determining the optimal stratification of the animal population basedon the provided attributes. (b) Hypergraph representing the zoo data. Each linecorresponds to an edge, whose elements are specified within the right column.(c) Stratification as obtained in [21]. (d) Stratification by the VB algorithm.

F ≤ −∑jk

(∑i

pikaij + α̃kj − 1)〈ln θkj〉

−∑jk

(∑i

pik(1 − aij) + β̃kj − 1)〈ln(1 − θkj)〉

+∑ik

pik ln pik +

∫dθR1(θ) lnR2(θ) +

∫dπR1(π) ln R2(π) + const. (28)

doi:10.1088/1742-5468/2009/07/P07006 10

J.Stat.M

ech.(2009)

P07006

Finding hypergraph communities

Minimizing (28) with respect to pik, R1(θ) and R2(π) we obtain

pik =e〈ln πk〉+

∑j [aij〈ln θkj〉+(1−aij )〈ln(1−θkj)〉]∑

s e〈ln πs〉+∑

j [aij〈ln θsj〉+(1−aij )〈ln(1−θsj)〉](29)

R1(θ) =∏kj

B(θkj; αkj, βkj), (30)

αkj = α̃kj +∑

i

pikaij , βkj = β̃kj +∑

i

pik(1 − aij) (31)

R2(π) = D(π; γ), γk = γ̃k +∑

i

pik (32)

F ∗ = const. +∑ik

pik ln pik −∑kj

ln Beta(αkj, βkj) − ln Beta(γ). (33)

These equations represent the VB algorithm for the statistical model on hypergraphs.

3.1. VB algorithm implementation

The implementation of the VB algorithm for the statistical model on hypergraphs proceedsas follows. Set a sufficiently large value for K, larger than our expectation for the actualvalue of K. We use K = 20 in the following test examples. Set the parameters α̃kj, β̃kj

and γ̃k. We set the parameters α̃kj = β̃kj = γ̃k = 10−6. Set random initial conditions forpik. Starting from these initial conditions iterate equations (29)–(33) until the solutionconverges up to some predefined accuracy. We use relative change of F ∗ smaller than 10−6.In practice, compute αkj, βkj, 〈ln θkj〉, 〈ln(1 − θkj)〉, γk, 〈lnπk〉, pik and F ∗ in that order.To explore different potential local minima use different initial conditions and select thesolution with lowest F ∗. Since this algorithm penalizes groups with few members it turnsout that, for sufficiently large K, some groups result as empty. If this is not the casethen increase K until at least one group is empty. A MATLAB code implementing thisalgorithm is provided as supplementary material.

3.2. Test example: zoo problem

Consider the animal population in figure 2(a) together with their attributes: habitat,nutrition behavior, etc. Figure 2(b) shows the mapping of this data set onto a hypergraph.The hypergraph vertices represent animals and the edges represent the association betweenall animals with a given attribute: edge 1, all non-airborne animals; edge 2, all airborneanimals, and so on.

The animal population stratification was already addressed in [21], finding the solutionin figure 2(c). Although the starting statistical model is the same, the solution in [21] wasfound assuming the number of groups fixed and estimating the group assignment usingthe EM algorithm (essentially a maximum likelihood estimate). Then, in an attempt tofocus on the solution with better consensus, solutions for different numbers of groups wereobtained and the most representative solution was selected.

Here we address the same problem using a Bayesian approach and the variationalsolution. We start from the same statistical model on hypergraphs but now obtain a

doi:10.1088/1742-5468/2009/07/P07006 11

J.Stat.M

ech.(2009)

P07006

Finding hypergraph communities

solution using the VB algorithm (29)–(33). 10 000 initial conditions were sampled tokeep the analysis as close as possible to that in [21]. The solution found by the VBalgorithm (figure 2(d)) is quite similar to that previously found in [21] (figure 2(c)).The main differences are the splitting of the terrestrial mammals, the exclusion of theplatypus and the tortoise from the amphibia–reptiles group and the scorpion from theterrestrial arthropods. More important, in both cases the main groups represent terrestrialmammals, aquatic mammals, birds, fishes, amphibia–reptiles, terrestrial arthropods andaquatic arthropods. The VB (29)–(33) algorithm represents, however, a significantimprovement over the approach followed in [21]. It finds the consensus solution in onerun, because it has built-in the balance between better fitting and less bias.

3.3. Test example: finding network modules

The work by Newman and Leicht [20] provides a hint as to how to apply the hypergraphclustering VB algorithm to the problem of finding graph modules or communities. Agraph is made from a set of vertices and a set of edges, the latter being pairs of connectedvertices. The idea of Leicht and Newman is a ‘guilty by association’ principle: verticeswithin the same graph module tend to have connections to the same other vertices. Thisproblem can be translated to a hypergraph problem, where the vertices are the graphvertices and the hyperedges are the set of nearest neighbors [21] (figures 1(c) and (d)).More precisely, with each vertex we associate a hyperedge, given by the set of its nearestneighbors. Therefore, there are m = n hyperedges, one for every vertex in the originalgraph. The hypergraph adjacency matrix has the matrix element aij = 1 if vertex ibelongs to hyperedge j, i.e. if vertex i belongs to the nearest-neighbor set of vertex j,and aij = 0 otherwise. On labeling the nearest-neighbor sets with the same label as thevertices, the hypergraph adjacency matrix coincides with the adjacency matrix of theoriginal graph. Thus, there is an exact mapping from the statistical model proposed byNewman and Leicht [20] to the statistical model on hypergraphs.

Having specified this mapping we use the VB algorithm (29)–(33) to find the graphmodules, sampling one initial condition. To illustrate its performance we consider as acase study a graph composed of two communities, with probabilities p1 and p2 that twovertices within the same or different communities are connected, respectively. To quantifythe goodness of the group assignment we consider the mutual information of the originalpO (pO

ik = δgik) and estimated p∗ group assignments,

I(pO, p∗) =∑kk′

ρkk′ lnρkk′

ρOk ρ∗

k′(34)

where

ρkk′ =1

n

∑i

pOikp

∗ik (35)

ρOk =

1

n

∑i

pOik (36)

ρ∗k =

1

n

∑i

p∗ik. (37)

Note that I(pO, p∗) takes its maximum value when p∗ = pO, denoted by I0 = I(pO, pO).

doi:10.1088/1742-5468/2009/07/P07006 12

J.Stat.M

ech.(2009)

P07006

Finding hypergraph communities

(a)

(c)

(b)

(d)

Figure 3. Finding graph modules; hypergraph model. Mutual informationI = I(pO, p∗) of the original pO and estimated p∗ group assignments, relativeto the maximum value I0 when p∗ = pO. The original data comprise randomgraphs with n = 100 vertices divided into K = 2 groups, with intra- and inter-community connection probabilities p1 and p2, respectively. ((a), (b)) The mutualinformation of the original groups and the group assignment estimated usingthe VB algorithm (29)–(33) sampling one initial condition, as a function of theinter-community connectivity p2. The dashed–dotted, solid and dashed linescorrespond to the worst, average and best cases for 100 test examples. In (a) wedeal with dense communities (p1 = 0.9) and the algorithm performs well (I/I0 ≈1) for small values of the inter-community connectivity probability p2. In (b) wedeal with sparse communities (p1 = 0.1) and the algorithm performs well for largevalues of the inter-community connectivity probability p2. In (a), the squares andtriangles were obtained as above but for graphs with 1000 and 10 000 vertices anda single graph realization. ((c), (d)) The same analysis for 100 vertices but usingthe Hofman and Wiggins VB algorithm [15] (http://vbmod.sourceforge.net/),sampling 25 initial conditions and using hyperparameter sets (c) (2, 1, 1, 2),(d) (1, 2, 2, 1) lines and (2, 1, 1, 2) symbols.

As already anticipated by Newman and Leicht [20], the nearest-neighbor approachcan resolve both dense communities with lesser inter-community connections (p1 � p2)and sparse communities with more inter-community connections (p1 p2). Figures 3(a)and (b) show that the VB algorithm performs quite well in those two regimes. Theadvantage of the VB algorithm is that we did not need to specify the exact number ofgroups, but just an upper bound (K = 20). These results hold for graph sizes of 100, 1000

doi:10.1088/1742-5468/2009/07/P07006 13

J.Stat.M

ech.(2009)

P07006

Finding hypergraph communities

Figure 4. Computational complexity. Scaling of the algorithm running time withthe number of graph vertices n, when applied to one realization of the randomgraphs described in figure 3. The solid line represents the n2 scaling suggestedby equations (29)–(33). These running times were obtained using a 1.86 GHzdesktop computer.

and 10 000 vertices, indicating that the number of groups predicted by the algorithm isindependent of the graph size, at least for the graphs analyzed here.

Another important question is that of how to determine the computational complexityof the algorithm. Equations (29)–(33) suggest a running time scaling as O(MK), whereM =

∑ij aij and K is as before the postulated maximum number of groups. In particular,

for application to graphs, M is given by the number of edges. Figure 4 reports the scalingof the algorithm running time, for graphs generated as described above, with increasingnumber of vertices n. The solid line emphasizes the expected M ∼ n2 scaling.

To compare the performance of the algorithm with respect to previous work weconsider the VB algorithm reported by Hofman and Wiggins [15]. Hofman andWiggins [15] have obtained a different VB algorithm based on a statistical model withdifferent intra- and inter-community connection probabilities. Their algorithm requiresas input the prior distribution of the intra- and inter-community edge densities vc andvd, respectively [15]. They used conjugate priors which in this case are given by the beta

distributions B(vc; c̃+0, c̃−0) and B(vd; d̃+0, d̃−0), respectively, where (c̃+0, c̃−0, d̃+0, d̃−0) isa set of ‘hyperparameters’. These hyperparameters can be tuned to control the symmetryof the beta distribution. For example, when c̃+0 > c̃+0 the maximum of B(vc; c̃+0, c̃−0) isfound between 1/2 and 1, but it is between 0 and 1/2 when c̃+0 > c̃+0. Thus, a good choiceof hyperparameters could be (2, 1, 1, 2) in the assortative case (p1 > p2) and (1, 2, 2, 1)in the disassortative case (p1 < p2) [25]. Using the Hofman and Wiggins algorithm(http://vbmod.sourceforge.net/) and the hyperparameters set (2, 1, 1, 2) for p1 = 0.1 and(1, 2, 2, 1) for p1 = 0.9 we have inferred the group assignments for the same graphs aswere considered above (figures 3(b) and(c)). The results are similar to those obtainedusing the VB algorithm developed here. We noticed, however, that the Hofman and

doi:10.1088/1742-5468/2009/07/P07006 14

J.Stat.M

ech.(2009)P

07006

Finding hypergraph communities

Wiggins algorithm can fail when applied to small datasets if the hyperparameters arenot properly chosen. For instance, for p1 = 0.1 the Hofman and Wiggins algorithm withhyperparameter set (2, 1, 1, 2) results in only one community, independently of the valueof p2 (figure 3(c), symbols). In contrast, the method introduced here does not require fine-tuning of the priors. The latter are completely specified by the transformation invariantpriors determined in section 2.3. Finally, the Hofman and Wiggins algorithm has thesame computational complexity, O(MK) [15]. Yet, given the available implementationsof both VB algorithms, the Hofman and Wiggins algorithm is faster (figure 4).

4. Conclusions

Taking inspiration from mixture models [1], in particular Dirichlet mixture models [14],we work on the Bayesian formulation of the problem of finding hypergraph communities.Our starting point is a statistical model on hypergraphs [21], previously introduced inthe context of graphs [20]. Using a MF approximation as variational function, we resolvethe population structure by solving the inverse problem, i.e. determining the hypergraphcommunities and model parameters from the data. The outcome is a variational Bayes(VB) algorithm, a self-consistent set of equations for determining the group assignmentsand the model parameters. The VB algorithm is based on recursive equations similar tothose for the EM algorithm, but with some intrinsic penalization for model bias.

The VB algorithm for the statistical model on hypergraphs can be used to findgraph modules as well. Starting from an idea of Newman and Leicht [20], we show thatthe problem of graph modules can be mapped onto the problem of finding hypergraphmodules, where the hypergraph edges represent nearest-neighbor sets in the original graph.The resulting VB algorithm represents a significant improvement over the maximumlikelihood approaches followed in [20] and [21], by including a self-consistent correctionfor model complexity and bias.

Depending on the starting statistical model, we can arrive at different versions ofthe VB algorithm. Hofman and Wiggins [15] have obtained another version based on astatistical model with different intra-and inter-community connection probabilities. Thelatter approach differs in the definition of what constitutes a group, community or module.We use the definition by Newman and Leicht [20] based on topological similarity, i.e. twovertices are topologically identical if they are connected to the same other vertices in thegraph. Thus, we obtain group of vertices whose patterns of connectivity are similar. Onthe other hand, the definition used by Hofman and Wiggins [15] is based on the existenceof two edge densities, characterizing the tendency of having an edge between intra- andinter-group pairs of vertices. Depending on the problem and the question that we areasking, we may adopt one or the other definition, and use the corresponding clusteringmethod.

Acknowledgments

IAS membership was partially funded by the Helen and Martin Chooljian Founders’ CircleMembers. Thanks go to C Wiggins for helpful discussions regarding the implementationof the Hofman and Wiggins VB algorithm.

doi:10.1088/1742-5468/2009/07/P07006 15

J.Stat.M

ech.(2009)

P07006

Finding hypergraph communities

References

[1] McLachlan G and Peel D, 2000 Finite Mixture Models (New York: Wiley)[2] Dempster A P, Laird N M and Rubin D B, Maximum likelihood from incomplete data via the em algorithm,

1977 J. R. Stat. Soc. B 39 1[3] Akaike H, A new look at the statistical model identification, 1974 IEEE Trans. Autom. Control 19 716[4] Schwarz G, Estimating the dimension of a model , 1978 Ann. Stat. 6 461[5] Abraham A, Das S and Roy S, Swarm intelligence algorithms for data clustering , 2007 Soft Computing for

Knowledge Discovery and Data Mining (Berlin: Springer) p 279[6] Jeffreys H, 1939 Theory of Probability (Oxford: Oxford University Press)[7] Spirtes P, Glymour C and Scheines R, 2000 Causation, Prediction, and Search (Cambridge, MA: MIT

Press)[8] MacKay D J C, 2003 Information Theory, Inference, and Learning Algorithms (Cambridge: Cambridge

University Press)[9] Robert C P, 2007 The Bayesian Choice (New York: Springer)

[10] Chen M-H, Q M Shao and Ibrahim J G, 2000 Monte Carlo Methods in Bayesian Computation (New York:Springer)

[11] Beal M J, Variational algorithms for approximate Bayesian inference, 2003 PhD Thesis GatsbyComputational Neuroscience Unit, University College London

[12] Yedidia J S, Freeman W T and Weiss Y, Constructing free-energy approximations and generalized beliefpropagation algorithms, 2005 IEEE Trans. Inf. Theory 51 2282

[13] Rasmussen C E, The infinite Gaussian mixture model , 2000 Adv. Neural Inf. Proc. Syst. 12 554[14] Blei D M, Ng A Y and Jordan M I, Latent Dirichlet allocation, 2003 J. Mach. Learn. Res. 3 993[15] Hofman J M and Wiggins C H, A Bayesian approach to network modularity , 2008 Phys. Rev. Lett.

100 258701[16] Newman M E, Modularity and community structure in networks, 2006 Proc. Nat. Acad. Sci. 103 8577[17] Arenas A, Dı́az-Guilera A and Perez-Vicente C J, Synchronization reveals topological scales in complex

networks, 2006 Phys. Rev. Lett. 96 114102[18] Palla G, Derenyi I, Farkas I and Vicsek T, Uncovering the overlapping community structure of complex

networks in nature and society , 2005 Nature 435 814[19] Rosvall M and Bergstrom C T, An information-theoretic framework for resolving community structure in

complex networks, 2007 Proc. Nat. Acad. Sci. 104 7327[20] Newman M E J and Leicht E A, Mixture models and exploratory analysis in networks, 2007 Proc. Nat.

Acad. Sci. 104 9564[21] Vazquez A, Population stratification using a statistical model on hypergraphs, 2008 Phys. Rev. E 77 066106[22] Kullback S, 1959 Information Theory and Statistics (New York: Wiley)[23] Jaynes E T, Prior probabilities, 1968 IEEE Trans. Syst. Sci. Cybern. SSC-4 227[24] Asuncion A and Newman D J, UCI machine learning repository ,

http://www.ics.uci.edu/mlearn/mlrepository.html[25] Wiggins C, 2009 personal communication

doi:10.1088/1742-5468/2009/07/P07006 16