53
? ? Inferring communities in large networks; the blessing of transitivity. Karl Rohe @stat.wisc.edu Join work with Tai Qin. Research supported by DMS-1309998. Text

Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

??

Inferring communities in large networks;

the blessing of transitivity.

Karl [email protected]

Join work with Tai Qin. Research supported by DMS-1309998.

Text

Page 2: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Historically, the overriding academic achievement of our field (stat) has been to develop a framework

for making quantitatively rigorous inference.

world

raw data

summary statisticsparameter estimates

visualizations

Experiment /data collection

data analysisstatisticalinference

Page 3: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Inference depends on a statistical model

world

summary statisticsparameter estimates

visualizations

statisticalinference

“All models are wrong.Some models are useful.”

Page 4: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

This talk will discuss network data

Graph = (node set, edge set)

Networks or graphs are useful as a way of simplifying a complex system of interactions.

Facebook: edges could represent posting / commenting / liking

Biology: edges could represent something causal

Page 5: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Leskovec et. al 2008

Community structure in large networks 21

(a) Zachary’s karate club network . . .

0.1

1

1 10

! (

conducta

nce)

k (number of nodes in the cluster)

cut Acut A+B

(b) . . . and it’s community profile plot

(c) Dolphins social network . . .

0.01

0.1

1

1 10 100

! (

conducta

nce)

k (number of nodes in the cluster)

cut

(d) . . . and it’s community profile plot

(e) Monks social network . . .

0.1

1

1 10

! (

conducta

nce)

k (number of nodes in the cluster)

cut

(f) . . . and it’s community profile plot

(g) Network science network . . .

0.001

0.01

0.1

1

1 10 100

! (

conducta

nce)

k (number of nodes in the cluster)

A

BC

D

C+E

(h) . . . and it’s community profile plot

Figure 5: Depiction of several small social networks that are common test sets for community detection algorithmsand their network community profile plots. (5(a)–5(b)) Zachary’s karate club network. (5(c)–5(d)) A network ofdolphins. (5(e)–5(f)) A network of monks. (5(g)–5(h)) A network of researchers researching networks.

Community structure in large networks 21

(a) Zachary’s karate club network . . .

0.1

1

1 10

! (

co

nd

ucta

nce

)

k (number of nodes in the cluster)

cut Acut A+B

(b) . . . and it’s community profile plot

(c) Dolphins social network . . .

0.01

0.1

1

1 10 100

! (

co

nducta

nce

)

k (number of nodes in the cluster)

cut

(d) . . . and it’s community profile plot

(e) Monks social network . . .

0.1

1

1 10

! (

con

ducta

nce)

k (number of nodes in the cluster)

cut

(f) . . . and it’s community profile plot

(g) Network science network . . .

0.001

0.01

0.1

1

1 10 100

! (

con

du

cta

nce

)

k (number of nodes in the cluster)

A

BC

D

C+E

(h) . . . and it’s community profile plot

Figure 5: Depiction of several small social networks that are common test sets for community detection algorithmsand their network community profile plots. (5(a)–5(b)) Zachary’s karate club network. (5(c)–5(d)) A network ofdolphins. (5(e)–5(f)) A network of monks. (5(g)–5(h)) A network of researchers researching networks.

Partitions, clusters, modularities, communities represent latent structure

• Edges simplify the underlying dyadic relationships.

• Communities suggest some latent structure in generating mechanism.

Algorithmic aim: put “similar” nodes in the same set and “different” nodes in different sets.

Page 6: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Local clustering algorithms were first proposed by Spielman and Teng.

• Spielman and Teng, 2008. “A local clustering algorithm for massive graphs and its application to nearly-linear time graph partitioning”

• really fast

• empirical success

• current theory:

• perturbation bounds for graph conductance

• bounds for running time.

Page 7: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Local Graph Partitioning using PageRank Vectors

Reid AndersenUniversity of California, San Diego

Fan ChungUniversity of California, San Diego

Kevin LangYahoo! Research

Abstract

A local graph partitioning algorithm finds a cut near a specified starting vertex, with arunning time that depends largely on the size of the small side of the cut, rather than the sizeof the input graph. In this paper, we present an algorithm for local graph partitioning usingpersonalized PageRank vectors. We develop an improved algorithm for computing approximatePageRank vectors, and derive a mixing result for PageRank vectors similar to that for randomwalks. Using this mixing result, we derive an analogue of the Cheeger inequality for PageRank,which shows that a sweep over a single PageRank vector can find a cut with conductance�, provided there exists a cut with conductance at most f(�), where f(�) is ⌦(�2/ log m), andwhere m is the number of edges in the graph. By extending this result to approximate PageRankvectors, we develop an algorithm for local graph partitioning that can be used to a find a cutwith conductance at most �, whose small side has volume at least 2b, in time O(2b log3 m/�2).Using this local graph partitioning algorithm as a subroutine, we obtain an algorithm that findsa cut with conductance � and approximately optimal balance in time O(m log4 m/�3).

1 Introduction

One of the central problems in algorithmic design is the problem of finding a cut with a smallconductance. There is a large literature of research papers on this topic, with applications innumerous areas.

Spectral partitioning, where an eigenvector is used to produce a cut, is one of the few approachesto this problem that can be analyzed theoretically. The Cheeger inequality [4] shows that thecut obtained by spectral partitioning has conductance within a quadratic factor of the optimum.Spectral partitioning can be applied recursively, with the resulting cuts combined in various ways,to solve more complicated problems; for example, recursive spectral algorithms have been used tofind k-way partitions, spectral clusterings, and separators in planar graphs [2, 8, 13, 14]. There isno known way to lower bound the size of the small side of the cut produced by spectral partitioning,and this adversely a↵ects the running time of recursive spectral partitioning.

Local spectral techniques provide a faster alternative to recursive spectral partitioning by avoid-ing the problem of unbalanced cuts. Spielman and Teng introduced a local partitioning algorithmcalled Nibble, which finds relatively small cuts near a specified starting vertex, in time proportionalto the volume of the small side of the cut. The small cuts found by Nibble can be combined toform balanced cuts and multiway partitions in almost linear time, and the Nibble algorithm isan essential subroutine in algorithms for graph sparsification and solving linear systems [15]. Theanalysis of the Nibble algorithm is based on a mixing result by Lovasz and Simonovits [9, 10],which shows that cuts with small conductance can be found by simulating a random walk andperforming sweeps over the resulting sequence of walk vectors.

1

arX

iv:0

809.

3232

v1 [

cs.D

S] 1

8 Se

p 20

08

A Local Clustering Algorithm for Massive Graphs and its

Application to Nearly-Linear Time Graph Partitioning!

Daniel A. SpielmanDepartment of Computer ScienceProgram in Applied Mathematics

Yale University

Shang-Hua TengDepartment of Computer Science

Boston University

September 18, 2008

Abstract

We study the design of local algorithms for massive graphs. A local algorithm is onethat finds a solution containing or near a given vertex without looking at the whole graph.We present a local clustering algorithm. Our algorithm finds a good cluster—a subset ofvertices whose internal connections are significantly richer than its external connections—near a given vertex. The running time of our algorithm, when it finds a non-empty localcluster, is nearly linear in the size of the cluster it outputs.

Our clustering algorithm could be a useful primitive for handling massive graphs, such associal networks and web-graphs. As an application of this clustering algorithm, we present apartitioning algorithm that finds an approximate sparsest cut with nearly optimal balance.Our algorithm takes time nearly linear in the number edges of the graph.

Using the partitioning algorithm of this paper, we have designed a nearly-linear timealgorithm for constructing spectral sparsifiers of graphs, which we in turn use in a nearly-linear time algorithm for solving linear systems in symmetric, diagonally-dominant matrices.The linear system solver also leads to a nearly linear-time algorithm for approximating thesecond-smallest eigenvalue and corresponding eigenvector of the Laplacian matrix of a graph.These other results are presented in two companion papers.

!This paper is the first in a sequence of three papers expanding on material that appeared first under the title“Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems” [ST03].The second paper, “Spectral Sparsification of Graphs” [ST08b] contains further results on partitioning graphs,and applies them to producing spectral sparsifiers of graphs. The third paper, “Nearly-Linear Time Algorithmsfor Preconditioning and Solving Symmetric, Diagonally Dominant Linear Systems” [ST08a] contains the resultson solving linear equations and approximating eigenvalues and eigenvectors.

This material is based upon work supported by the National Science Foundation under Grant Nos. 0325630,0634957, 0635102 and 0707522. Any opinions, findings, and conclusions or recommendations expressed in thismaterial are those of the authors and do not necessarily reflect the views of the National Science Foundation.

1

A local graph partitioning algorithm using heat kernel pagerank

Fan Chung

University of California at San Diego, La Jolla CA 92093, USA,[email protected], http://www.math.ucsd.edu/~fan/

Abstract. We give an improved local partitioning algorithm using heat kernel pagerank, a modifiedversion of PageRank. For a subset S with Cheeger ratio (or conductance) h, we show that there are atleast a quarter of the vertices in S that can serve as seeds for heat kernel pagerank which lead to localcuts with Cheeger ratio at most O(

!h), improving the previously bound by a factor of log |S|.

1 Introduction

With the emergence of massive information networks, many previous algorithms are often no longer feasible.A basic setup for a generic algorithm usually includes a graph as a part of its input. This, however, is nolonger possible for dealing with massive graphs with prohibitive large size. Instead, the (host) graph, such asthe WWW-graph or various social networks, is usually meticulously crawled, organized and stored in someappropriate database. The local algorithms that we study here involve only “local access” of the databaseof the host graph. For example, getting a neighbor of a specified vertex is considered to be a type of localaccess. Of course, it is desirable to minimize the number of local accesses needed, hopefully independent ofn, the number of vertices in the host graph (which may as well be regarded as “infinity”). In this paper, weconsider a local algorithm that improves the performance bound of previous local partitioning algorithms.

Graph partitioning problems have long been studied and used for a wide range of applications, typicallyalong the line of divide-and-conquer approaches. Since the exact solution for graph partitioning is known tobe NP-complete [12], various approximation algorithms have been utilized. One of the best known partitionalgorithms is the spectral algorithm. The vertices are ordered by using an eigenvector and only cuts whichare initial segments in such an ordering are considered. The advantage of such a “one-sweep” algorithm is toreduce the number of cuts under consideration from an exponential number in n to a linear number. Still,there is a performance guarantee within a quadratic order by using a relation between eigenvalues and theCheeger constant, called the Cheeger inequality. However, for a large graph (say, with a hundred millionvertices), the task of computing an eigenvector is often too costly and not competitive.

A local partitioning algorithm finds a partition that separates a subset of nodes of specified size nearspecified seeds. In addition, the running time of a local algorithm is required to be proportional to the sizeof the separated part but independent of the total size of the graph. In [21], Spielman and Teng first gave alocal partitioning algorithm using random walks. The analysis of their algorithm is based on a mixing resultof Lovasz and Simonovits in their work on approximating the volume of convex body. The same mixing resultwas also proved by Mihail earlier independently [19]. In a previous paper [1], a local partitioning algorithmwas given using PageRank, a concept first introduced by Brin and Page [3] in 1998 which has been widelyused for Web search algorithms. PageRank is a quantitative ordering of vertices based on random walkson the Webgraph. The notion of PageRank which can be carried out for any graph is basically an e!cientway of organizing random walks in a graph. As seen in the detailed definition given later, PageRank can beexpressed as a geometric sum of random walks starting from the seed (or an initial probability distribution),with its speed of propagation controlled by a jumping constant. The usual question in random walks is todetermine how many steps are required to get close to stationary distribution. In the use of PageRank, the

Detecting Sharp Drops in PageRank and aSimplified Local Partitioning Algorithm

Reid Andersen and Fan Chung

University of California at San Diego, La Jolla CA 92093, USA,[email protected], http://www.math.ucsd.edu/~fan/,

[email protected], http://www.math.ucsd.edu/~randerse/

Abstract. We show that whenever there is a sharp drop in the numer-ical rank defined by a personalized PageRank vector, the location of thedrop reveals a cut with small conductance. We then show that for anycut in the graph, and for many starting vertices within that cut, an ap-proximate personalized PageRank vector will have a sharp drop su!cientto produce a cut with conductance nearly as small as the original cut.Using this technique, we produce a nearly linear time local partitioningalgorithm whose analysis is simpler than previous algorithms.

1 Introduction

When we are dealing with computational problems arising in complex networkswith prohibitively large sizes, it is often desirable to perform computations whosecost can be bounded by a function of the size of their output, which may be quitesmall in comparison with the size of the whole graph. Such algorithms we calllocal algorithms (see [1]). For example, a local graph partitioning algorithm findsa cut near a specified starting vertex, with a running time that depends on thesize of the small side of the cut, rather than the size of the input graph.

The first local graph partitioning algorithm was developed by Spielman andTeng [8], and produces a cut by computing a sequence of truncated randomwalk vectors. A more recent local partitioning algorithm achieves a better run-ning time and approximation ratio by computing a single personalized PageRankvector [1]. Because a PageRank vector is defined recursively (as we will describein the next section), a sweep over a single approximate PageRank vector canproduce cuts with provably small conductance. Although this use of PageRanksimplified the process of finding cuts, the analysis required to extend the ba-sic cut-finding method into an e!cient local partitioning algorithm remainedcomplicated.

In this paper, we consider the following consequence of the personalizedPageRank equation,

p = !v + (1 ! !)pW,

arX

iv:0

811.

3779

v1 [

cs.D

S] 2

3 N

ov 2

008

Finding Sparse Cuts Locally Using Evolving Sets

Reid Andersen and Yuval Peres

November 23, 2008

Abstract

A local graph partitioning algorithm finds a set of vertices with small conductance (i.e. asparse cut) by adaptively exploring part of a large graph G, starting from a specified vertex.For the algorithm to be local, its complexity must be bounded in terms of the size of the setthat it outputs, with at most a weak dependence on the number n of vertices in G. Previouslocal partitioning algorithms find sparse cuts using random walks and personalized PageRank.In this paper, we introduce a randomized local partitioning algorithm that finds a sparse cut bysimulating the volume-biased evolving set process, which is a Markov chain on sets of vertices. Weprove that for any set of vertices A that has conductance at most !, for at least half of the startingvertices in A our algorithm will output (with probability at least half), a set of conductanceO(!1/2 log1/2 n). We prove that for a given run of the algorithm, the expected ratio betweenits computational complexity and the volume of the set that it outputs is O(!!1/2 polylog(n)).In comparison, the best previous local partitioning algorithm, due to Andersen, Chung, andLang, has the same approximation guarantee, but a larger ratio of O(!!1 polylog(n)) betweenthe complexity and output volume. Using our local partitioning algorithm as a subroutine, weconstruct a fast algorithm for finding balanced cuts. Given a fixed value of !, the resultingalgorithm has complexity (m + n!!1/2)) · O(polylog(n)) and returns a cut with conductanceO(!1/2 log1/2 n) and volume at least v!/2, where v! is the largest volume of any set withconductance at most !.

1

Local Partitioning for Directed Graphs UsingPageRank

Reid Andersen1, Fan Chung2, and Kevin Lang3

1 Microsoft Research, Redmond WA [email protected]

2 University of California, San Diego, La Jolla CA [email protected]

3 Yahoo! Research, Santa Clara CA [email protected]

Abstract. A local partitioning algorithm finds a set with small conduc-tance near a specified seed vertex. In this paper, we present a generaliza-tion of a local partitioning algorithm for undirected graphs to stronglyconnected directed graphs. In particular, we prove that by computing apersonalized PageRank vector in a directed graph, starting from a singleseed vertex within a set S that has conductance at most !, and by per-forming a sweep over that vector, we can obtain a set of vertices S! withconductance "M (S!) = O(

!! log |S|). Here, the conductance function

"M is defined in terms of the stationary distribution of a random walkin the directed graph. In addition, we describe how this algorithm maybe applied to the PageRank Markov chain of an arbitrary directed graph,which provides a way to partition directed graphs that are not stronglyconnected.

1 Introduction

In directed networks like the world wide web, it is critical to develop algorithmsthat utilize the additional information conveyed by the direction of the links. Algo-rithms for web crawling, web mining, and search ranking, all depend heavily on thedirectedness of the graph. For the problem of graph partitioning, it is extremelychallenging to develop algorithms that e!ectively utilize the directed links.

Spectral algorithms for graph partitioning have natural obstacles for gener-alizations to directed graphs. Nonsymmetric matrices do not have a spectraldecomposition, meaning there does not necessarily exist an orthonormal basis ofeigenvectors. The stationary distribution for random walks on directed graphsis no longer determined by the degree sequences. In the earlier work of Fill [7]and Mihail [12], several generalizations for directed graphs were examined forregular graphs. Lovasz and Simonovits [11] established a bound for the mixingrate of an asymmetric ergodic Markov chain in terms of its conductance. Whenapplied to the Markov chain of a random walk in a strongly connected directedgraph, their results can be used to identify a set of states of the Markov chainwith small conductance. Algorithms for finding sparse cuts, based on linear andsemidefinite programming and metric embeddings, have also been generalized to

A. Bonato and F.R.K. Chung (Eds.): WAW 2007, LNCS 4863, pp. 166–178, 2007.c! Springer-Verlag Berlin Heidelberg 2007

Page 8: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

This talk aims to provide a statistical framework for local clustering by

(1) showing that sparse and transitive Stochastic Blockmodels (aka planted partition models) naturally lead to local clustering.

(2) illustrating how the blessing of transitivity makes small clusters easy to estimate (both statistically and algorithmically).

Page 9: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

• K = number of blocks

• s = population of each block. So, all blocks have equal population.

• r = probability of an out-of-block connection

• p = probability of an in-block connection

n = KsSo,

Consider the planted partition model(a simplified Stochastic Blockmodel)

Assume r<p.

Page 10: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Adjacency matrix from the Stochastic Blockmodel

We want to estimate the partition of the nodes Z.

Page 11: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Extensive literature studies the estimation of Z

• Started in IEEE community

• McSherry. 2001. “Spectral partitioning of random graphs.”

• Dasgupta, Hopcroft, and McSherry. 2004. “Spectral analysis of random graphs with skewed degree distributions.”

• Great expansion in literature over past 4 years . . .

Page 12: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

A nonparametric view of network models andNewman–Girvan and other modularitiesPeter J. Bickela,1 and Aiyou Chenb

aUniversity of California, Berkeley, CA 94720; and bAlcatel-Lucent Bell Labs, Murray Hill, NJ 07974

Edited by Stephen E. Fienberg, Carnegie Mellon University, Pittsburgh, PA, and approved October 13, 2009 (received for review July 2, 2009)

Prompted by the increasing interest in networks in many fields,we present an attempt at unifying points of view and analyses ofthese objects coming from the social sciences, statistics, probabilityand physics communities. We apply our approach to the Newman–Girvan modularity, widely used for “community” detection, amongothers. Our analysis is asymptotic but we show by simulation andapplication to real examples that the theory is a reasonable guideto practice.

modularity | profile likelihood | ergodic model | spectral clustering

T he social sciences have investigated the structure of smallnetworks since the 1970s, and have come up with elaborate

modeling strategies, both deterministic, see Doreian et al. (1) fora view, and stochastic, see Airoldi et al. (2) for a view and recentwork. During the same period, starting with the work of Erdösand Rényi (3), a rich literature has developed on the probabilisticproperties of stochastic models for graphs. A major contributionto this work is Bollobás et al. (4). On the whole, the goals of theanalyses of ref. 4, such as emergence of the giant component, arenot aimed at the statistical goals of the social science literature wehave cited.

Recently, there has been a surge of interest, particularly in thephysics and computer science communities in the properties ofnetworks of many kinds, including the Internet, mobile networks,the World Wide Web, citation networks, email networks, foodwebs, and social and biochemical networks. Identification of “com-munity structure” has received particular attention: the vertices innetworks are often found to cluster into small communities, wherevertices within a community share the same densities of connect-ing with vertices in the their own community as well as differentones with other communities. The ability to detect such groups canbe of significant practical importance. For instance, groups withinthe worldwide Web may correspond to sets of web pages on relatedtopics; groups within mobile networks may correspond to sets offriends or colleagues; groups in computer networks may corre-spond to users that are sharing files with peer-to-peer traffic, orcollections of compromised computers controlled by remote hack-ers, e.g. botnets (5). A recent algorithm proposed by Newman andGirvan (6), that maximizes a so-called “Newman–Girvan” mod-ularity function, has received particular attention because of itssuccess in many applications in social and biological networks (7).

Our first goal is, by starting with a model somewhat less generalthan that of ref. 4, to construct a nonparametric statistical frame-work, which we will then use in the analysis, both of modularitiesand parametric statistical models. Our analysis is asymptotic, let-ting the number of vertices go to !. We view, as usual, asymptoticsas being appropriate insofar as they are a guide to what happensfor finite n. Our models can, on the one hand, be viewed as specialcases of those proposed by ref. 4, and on the other, as encompass-ing most of the parametric and semiparametric models discussedin Airoldi et al. (2) from a statistical point of view and in Chung andLu (8) for a probabilistic one. An advantage of our framework isthe possibility of analyzing the properties of the Newman–Girvanmodularity, and the reasons for its success and occasional fail-ures. Our approach suggests an alternative modularity which is, in

principle, “fail-safe” for rich enough models. Moreover, our pointof view has the virtue of enabling us to think in terms of “strengthof relations” between individuals not necessarily clustering theminto communities beforehand.

We begin, using results of Aldous and Hoover (9), by introduc-ing what we view as the analogues of arbitrary infinite populationmodels on infinite unlabeled graphs which are “ergodic” and fromwhich a subgraph with n vertices can be viewed as a piece. Thisdevelopment of Aldous and Hoover can be viewed as a gener-alization of deFinetti’s famous characterization of exchangeablesequences as mixtures of i.i.d. ones. Thus, our approach can also beviewed as a first step in the generalization of the classical construc-tion of complex statistical models out of i.i.d. ones using covariates,information about labels and relationships.

It turns out that natural classes of parametric models whichapproximate the nonparametric models we introduce are the“blockmodels” introduced by Holland, Laskey and Leinhardtref. 10; see also refs. 2 and 11, which are generalizations of theErdös–Rényi model. These can be described as follows.

In a possibly (at least conceptually) infinite population (of ver-tices) there are K unknown subcommunities. Unlabeled individ-uals (vertices) relate to each other through edges which for thispaper we assume are undirected. This situation leads to the follow-ing set of probability models for undirected graphs or equivalentlythe corresponding adjacency matrices {Aij : i, j " 1}, where Aij =1 or 0 according as there is or is not an edge between i and j.

1. Individuals independently belong to community j withprobability !j, 1 # j # K ,

!Kj=1 !j = 1.

2. A symmetric K $K matrix {Pkl : 1 # k, l # K} of probabil-ities is given such that Pab is the probability that a specificindividual i relates to individual j given that i % a, j % b.The membership relations between individuals are estab-lished independently. Thus 1 & !

1#a,b#K !a!bPab is theprobability that there is no edge between i and j.

The Erdös–Rényi model corresponds to K = 1.We proceed to define Newman–Girvan modularity and an alter-

native statistically motivated modularity. We give necessary andsufficient conditions for consistency based on the parameters ofthe block model, properties of the modularities, and averagedegree of the graph. By consistency we mean that the modular-ities can identify the members of the block model communitiesperfectly. We also give examples of inconsistency when the con-ditions fail. We then study the validity of the asymptotics in alimited simulation and apply our approach to a classical smallexample, the Karate Club and a large set of Private BranchExchange (PBX) data. We conclude with a discussion and someopen problems.

Author contributions: P.J.B. and A.C. performed research and analyzed data.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.1To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/cgi/content/full/0907096106/DCSupplemental.

21068–21073 PNAS December 15, 2009 vol. 106 no. 50 www.pnas.org / cgi / doi / 10.1073 / pnas.0907096106

Biometrika (2012), pp. 1–12 doi: 10.1093/biomet/asr053C! 2012 Biometrika Trust

Printed in Great Britain

Stochastic blockmodels with a growing number of classes

BY D. S. CHOISchool of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts

02138, [email protected]

P. J. WOLFEDepartment of Statistical Science, University College London, London WC1E 6BT, U.K.

[email protected]

AND E. M. AIROLDIDepartment of Statistics, Harvard University, Cambridge, Massachusetts 02138, U.S.A.

[email protected]

SUMMARYWe present asymptotic and finite-sample results on the use of stochastic blockmodels for the

analysis of network data. We show that the fraction of misclassified network nodes convergesin probability to zero under maximum likelihood fitting when the number of classes is allowedto grow as the root of the network size and the average network degree grows at least poly-logarithmically in this size. We also establish finite-sample confidence bounds on maximum-likelihood blockmodel parameter estimates from data comprising independent Bernoulli randomvariates; these results hold uniformly over class assignment. We provide simulations verifyingthe conditions sufficient for our results, and conclude by fitting a logit parameterization of astochastic blockmodel with covariates to a network data example comprising self-reported schoolfriendships, resulting in block estimates that reveal residual structure.

Some key words: Likelihood-based inference; Social network analysis; Sparse random graph; Stochastic blockmodel.

1. INTRODUCTIONThe global structure of social, biological, and information networks is sometimes envi-

sioned as the aggregate of many local interactions whose effects propagate in ways that arenot yet well understood. There is increasing opportunity to collect data on an appropriatescale for such systems, but their analysis remains challenging (Goldenberg et al., 2009). Herewe analyse a statistical model for network data known as the single-membership stochas-tic blockmodel. Its salient feature is that it partitions the N nodes of a network into K dis-tinct classes whose members all interact similarly with the network. Blockmodels were firstassociated with the deterministic concept of structural equivalence in social network analysis(Lorrain & White, 1971), where two nodes were considered interchangeable if their connec-tions were equivalent in a formal sense. This concept was adapted to stochastic settings andgave rise to the stochastic blockmodel in the work by Holland et al. (1983) and Fienberg et al.(1985). The model and extensions thereof have since been applied in a variety of disciplines

Biometrika Advance Access published April 17, 2012

at University of W

isconsin General Library System

on May 25, 2012

http://biomet.oxfordjournals.org/

Dow

nloaded from

JMLR: Workshop and Conference Proceedings vol (2012) 35.1–35.23 25th Annual Conference on Learning Theory

Spectral Clustering of Graphs with General Degrees in the ExtendedPlanted Partition Model

Kamalika Chaudhuri [email protected]

Fan Chung [email protected]

Alexander Tsiatas [email protected]

Department of Computer Science and Engineering

University of California, San Diego

La Jolla, CA 92093, USA

Editor: Shie Mannor, Nathan Srebro, Robert C. Williamson

AbstractIn this paper, we examine a spectral clustering algorithm for similarity graphs drawn from a simplerandom graph model, where nodes are allowed to have varying degrees, and we provide theoreticalbounds on its performance. The random graph model we study is the Extended Planted Partition(EPP) model, a variant of the classical planted partition model.

The standard approach to spectral clustering of graphs is to compute the bottom k singular vec-tors or eigenvectors of a suitable graph Laplacian, project the nodes of the graph onto these vectors,and then use an iterative clustering algorithm on the projected nodes. However a challenge withapplying this approach to graphs generated from the EPP model is that unnormalized Laplaciansdo not work, and normalized Laplacians do not concentrate well when the graph has a number oflow degree nodes.

We resolve this issue by introducing the notion of a degree-corrected graph Laplacian. Forgraphs with many low degree nodes, degree correction has a regularizing effect on the Laplacian.Our spectral clustering algorithm projects the nodes in the graph onto the bottom k right singularvectors of the degree-corrected random-walk Laplacian, and clusters the nodes in this subspace.We show guarantees on the performance of this algorithm, demonstrating that it outputs the correctpartition under a wide range of parameter values. Unlike some previous work, our algorithm doesnot require access to any generative parameters of the model.Keywords: Spectral clustering, unsupervised learning, normalized Laplacian

1. Introduction

Spectral clustering of similarity graphs is a fundamental tool in exploratory data analysis, whichhas enjoyed much empirical success (Shi and Malik, 2000; Ng et al., 2002; von Luxburg, 2007) inmachine-learning. In this paper, we examine a spectral clustering algorithm for similarity graphsdrawn from a simple random graph model, where nodes are allowed to have varying degrees, andwe provide theoretical bounds on its performance. Such clustering problems arise in the contextof partitioning social network graphs to reveal hidden communities, or partitioning communicationnetworks to reveal groups of nodes that frequently communicate.

The random graph model we study is the Extended Planted Partition (EPP) model, a variant ofthe classical planted partition model. A graph G = (V,E) generated from this model has a hiddenpartition V1, . . . , V

k

, as well as a number du

associated with each node u. If two nodes u and v

c� 2012 K. Chaudhuri, F. Chung & A. Tsiatas.

arX

iv:1

105.

3288

v3 [

mat

h.ST

] 30

Sep

201

2

Electronic Journal of StatisticsVol. 0 (0000)ISSN: 1935-7524DOI: 10.1214/154957804100000000

Consistency of maximum-likelihood and

variational estimators in the Stochastic

Block Model !

Alain Celisse†

Laboratoire de MathematiqueUMR 8524 CNRS– Universite Lille 1

F-59 655 Villeneuve d’Ascq Cedex, Francee-mail: [email protected]

and

Jean-Jacques Daudin

UMR 518 AgroParisTech/INRA MIA16 rue Claude Bernard

F-75 231 Paris Cedex 5, Francee-mail: [email protected]

and

Laurent Pierre

Universite Paris X200 avenue de la Republique

F- 92 001 Nanterre Cedex Francee-mail: [email protected]

Abstract: The stochastic block model (SBM) is a probabilistic model de-signed to describe heterogeneous directed and undirected graphs. In thispaper, we address the asymptotic inference in SBM by use of maximum-likelihood and variational approaches. The identifiability of SBM is provedwhile asymptotic properties of maximum-likelihood and variational estima-tors are derived. In particular, the consistency of these estimators is settledfor the probability of an edge between two vertices (and for the group pro-portions at the price of an additional assumption), which is to the bestof our knowledge the first result of this type for variational estimators inrandom graphs.

AMS 2000 subject classifications: Primary 62G05, 62G20; secondary62E17, 62H30.Keywords and phrases: Random graphs, Stochastic Block Model, max-imum likelihood estimators, variational estimators, consistency, concentra-tion inequalities.

!Authors thank Catherine Matias and Mahendra Mariadassou for their helpful comments.†This work has been supported by the French Agence Nationale de la Recherche (ANR)

under reference ANR-09-JCJC-0027-01

0

imsart-ejs ver. 2010/04/27 file: SBM_Var_MLE_EJS.tex date: October 2, 2012

et al

A consistent adjacency spectral embedding for

stochastic blockmodel graphs

Daniel L. Sussman, Minh Tang, Donniell E. Fishkind, Carey E. Priebe

Johns Hopkins University, Applied Math and Statistics Department

April 30, 2012

Abstract

We present a method to estimate block membership of nodes in a ran-dom graph generated by a stochastic blockmodel. We use an embeddingprocedure motivated by the random dot product graph model, a partic-ular example of the latent position model. The embedding associateseach node with a vector; these vectors are clustered via minimizationof a square error criterion. We prove that this method is consistent forassigning nodes to blocks, as only a negligible number of nodes will bemis-assigned. We prove consistency of the method for directed and undi-rected graphs. The consistent block assignment makes possible consistentparameter estimation for a stochastic blockmodel. We extend the resultin the setting where the number of blocks grows slowly with the num-ber of nodes. Our method is also computationally feasible even for verylarge graphs. We compare our method to Laplacian spectral clusteringthrough analysis of simulated data and a graph derived from Wikipediadocuments.

1 Background and Overview

Network analysis is rapidly becoming a key tool in the analysis of moderndatasets in fields ranging from neuroscience to sociology to biochemistry. Ineach of these fields, there are objects, such as neurons, people, or genes, andthere are relationships between objects, such as synapses, friendships, or pro-tein interactions. The formation of these relationships can depend on attributesof the individual objects as well as higher order properties of the network asa whole. Objects with similar attributes can form communities with similarconnective structure, while unique properties of individuals can fine tune theshape of these relationships. Graphs encode the relationships between objectsas edges between nodes in the graph.

Clustering objects based on a graph enables identification of communitiesand objects of interest as well as illumination of overall network structure. Find-ing optimal clusters is di�cult and will depend on the particular setting and

1

arX

iv:1

108.

2228

v3 [

stat.M

L] 2

7 A

pr 2

012

Classification and estimation in the Stochastic

Block Model based on the empirical degrees

Antoine Channarond

Jean-Jacques Daudin

and Stéphane Robin

Département de Mathématiques et Informatique Appliquées

UMR 518 AgroParisTech/INRA

16 rue Claude Bernard, 75231 Paris Cedex 5, France

channarond,daudin,[email protected]

AbstractThe Stochastic Block Model (Holland et al., 1983) is a mixture model

for heterogeneous network data. Unlike the usual statistical framework,new nodes give additional information about the previous ones in thismodel. Thereby the distribution of the degrees concentrates in pointsconditionally on the node class. We show under a mild assumption thatclassification, estimation and model selection can actually be achievedwith no more than the empirical degree data. We provide an algorithmable to process very large networks and consistent estimators based on it.In particular, we prove a bound of the probability of misclassification ofat least one node, including when the number of classes grows.

1 Introduction

Strong attention has recently been paid to network models in many domains suchas social sciences, biology or computer science. Networks are used to representpairwise interactions between entities. For example, sociologists are interestedin observing friendships, calls and collaboration between people, companies orcountries. Genomicists wonder which gene regulates which other. But themost famous examples are undoubtedly the Internet, where data traffic involvesmillions of routers or computers, and the World Wide Web, containing millionsof pages connected by hyperlinks. A lot of other examples of real-world networksare empirically treated in Albert and Barabási (2002), and book Faust andWasserman (1994) gives a general introduction to mathematical modelling ofnetworks, and especially to graph theory.

One of the main features expected from graph models is inhomogeneity.Some articles, e.g. Bollobás et al. (2007) or Van Der Hofstad (2009), addressthis question. In the Erdős-Rényi model introduced by Erdős and Rényi (1959)and Gilbert (1959), all nodes play the same role, while most real-world networksare definitely not homogeneous.

1

arX

iv:1

110.

6517

v1 [

mat

h.ST

] 29

Oct

201

1

Regularized Spectral Clustering under the

Degree-Corrected Stochastic Blockmodel

Tai Qin

Department of StatisticsUniversity of Wisconsin-Madison

Madison, [email protected]

Karl Rohe

Department of StatisticsUniversity of Wisconsin-Madison

Madison, [email protected]

Abstract

Spectral clustering is a fast and popular algorithm for finding clusters in net-works. Recently, Chaudhuri et al. [1] and Amini et al. [2] proposed inspiredvariations on the algorithm that artificially inflate the node degrees for improvedstatistical performance. The current paper extends the previous statistical esti-mation results to the more canonical spectral clustering algorithm in a way thatremoves any assumption on the minimum degree and provides guidance on thechoice of the tuning parameter. Moreover, our results show how the “star shape”in the eigenvectors–a common feature of empirical networks–can be explainedby the Degree-Corrected Stochastic Blockmodel and the Extended Planted Par-tition model, two statistical models that allow for highly heterogeneous degrees.Throughout, the paper characterizes and justifies several of the variations of thespectral clustering algorithm in terms of these models.

1 Introduction

Our lives are embedded in networks–social, biological, communication, etc.– and many researcherswish to analyze these networks to gain a deeper understanding of the underlying mechanisms. Sometypes of underlying mechanisms generate communities (aka clusters or modularities) in the network.As machine learners, our aim is not merely to devise algorithms for community detection, but alsoto study the algorithm’s estimation properties, to understand if and when we can make justifiable in-ferences from the estimated communities to the underlying mechanisms. Spectral clustering is a fastand popular technique for finding communities in networks. Several previous authors have studiedthe estimation properties of spectral clustering under various statistical network models (McSherry[3], Dasgupta et al. [4], Coja-Oghlan and Lanka [5], Ames and Vavasis [6], Rohe et al. [7], Sussmanet al. [8] and Chaudhuri et al. [1]). Recently, Chaudhuri et al. [1] and Amini et al. [2] proposed twoinspired ways of artificially inflating the node degrees in ways that provide statistical regularizationto spectral clustering.

This paper examines the statistical estimation performance of regularized spectral clustering underthe Degree-Corrected Stochastic Blockmodel (DC-SBM), an extension of the Stochastic Block-model (SBM) that allows for heterogeneous degrees (Holland and Leinhardt [9], Karrer and New-man [10]). The SBM and the DC-SBM are closely related to the planted partition model and theextended planted partition model, respectively. We extend the previous results in the following ways:(a) In contrast to previous studies, this paper studies the regularization step with a canonical versionof spectral clustering that uses k-means. The results do not require any assumptions on the min-imum expected node degree; instead, there is a threshold demonstrating that higher degree nodesare easier to cluster. This threshold is a function of the leverage scores that have proven essentialin other contexts, for both graph algorithms and network data analysis (see Mahoney [11] and ref-erences therein). These are the first results that relate leverage scores to the statistical performance

1

Page 13: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Current theory suggests that three quantities regulate the difficulty of

estimating Z.

1. Edge sparsity. More edges are better.

2. Size of the smallest cluster. Bigger is better.

3. Difference between in-block and out-of-block probabilities (when they are equal the model is unidentifiable!)

Page 14: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

(1) Sparse and transitive models (stochastic block or planted partition) naturally lead to local clustering with large p, vanishing r.

(2) this blessing of transitivity makes small clusters easy to estimate (both statistically and algorithmically), even under “semi-parametric” models

A statistical framework for local clustering

Page 15: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Empirical networks are sparse and transitive.

• Empirically, the average node degree in most networks is between 10 and 100, even in “massive graphs.”

• Moreover, most networks have many more triangles than we would expect under an Erdős–Rényi random graph.

Page 16: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Transitivity: Friends-of-friends are likely to be friends.

v

u

i??

v

u

i

two-star triangle

Another popular measure of transitivity is the clustering coefficient.(Watts and Strogatz 1998)

Page 17: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Under the planted partition model with r < p.a) If p goes to zero, then you remove transitivity.b) If p is bounded from below, then block size cannot grow faster than the expected degree.

Page 18: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

(1) p� = P (Auv = 1|Aiu = Aiv = 1)

p� p

Proposition 1. Under the four parameter Stochastic Block-

model with r p,

a) if p ! 0, then

p� = P (Auv = 1|Aiu = Aiv = 1) ! 0.

b) if p is bounded from below, then

s = O(�n)

where s is the population of each block and �n is the

expected node degree.

Definition 1. Suppose A 2 {0, 1}(n+s)⇥(n+s)is an adja-

cency matrix and S⇤ is a set of nodes with |S⇤| = s. For

j 2 Sc⇤, define

d⇤j =X

`2Sc⇤

A`j.

If

(1) i 2 S⇤ and j 2 Sc⇤ implies

P (Aij = 1) d⇤jn,

(2) i, j 2 S⇤ implies P (Aij = 1) � pin,(3) {Aij : 8j and 8i 2 S⇤} are mutually independent

then A follows the local degree-corrected Stochastic

Blockmodel with parameters S⇤, pin.

1

Similar results to part a) hold under the more general Exchangeable Random Graph Model.

Under the planted partition model with r < p.a) If p goes to zero, then you remove transitivity.b) If p is bounded from below, then block size cannot grow faster than the expected degree.

Page 19: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

• In planted partition model with bounded expected degree, transitivity implies:

The blessing of transitivity

(i) p > ✏ > 0, (ii) r = O(1/n), and (iii) s = O(1)

Page 20: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Previous results suggest estimation could be very difficult.

• Previous results require (1) growing degrees and (2) growing blocks

• We want (1) bounded degree and thus (2) bounded blocks

• Still possible because r --> 0 and p is fixed.

Page 21: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Blessing of transitivity doesn’t make “bad” edges disappear. It makes “bad” triangles disappear.

(1) Out-of-block edges can be common. O(n)

(2) Out-of-block triangles are unlikely. O(1)

(3) In-block triangles are common. O(n)

Note: We look at triangles for computational purposes. Could look for 4-cliques, then (2) becomes O(1/n) and (3) gets a new constant.

In sparse and transitive planted partition model

Page 22: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Related research• R, Qin, Fan 2012 “Highest dimensional Stochastic

Blockmodel with a regularized estimator”

• used the restricted MLE

• Verzelen and Arias-Castro, 2013. “Community Detection in Sparse Random Networks”

• Hypothesis testing

• Ho: massive, sparse Erdos–Rényi

• Ha: hidden block, growing faster than log(N).

• Number of triangles is a powerful test statistic.

Page 23: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Empirical networks contain small communities.

• Leskovec, Lang, Dasgupta, Mahoney 2008. In large empirical networks, communities with smallest conductance are no larger than 100 nodes.

• Transitivity plays a key role and we should use triangles to discover the local clusters.

Community structure in large networks 33

10-4

10-3

10-2

10-1

100

100

101

102

103

104

105

106

Φ (

co

nd

ucta

nce

)

k (number of nodes in the cluster)

Original networkWhiskers removed

10-3

10-2

10-1

100

100

101

102

103

104

ϕ (

co

nd

ucta

nce

)

n (number of nodes in the cluster)

Original networkWhiskers removed

(a) LiveJournal01 (b) Epinions

10-2

10-1

100

100

101

102

103

104

105

Φ (

co

nd

ucta

nce

)

k (number of nodes in the cluster)

Original networkWhiskers removed

10-4

10-3

10-2

10-1

100

100

101

102

103

104

105

106

Φ (

co

nd

ucta

nce

)

k (number of nodes in the cluster)

Original networkWhiskers removed

(c) Cit-hep-th (d) Web-Google

10-3

10-2

10-1

100

100

101

102

103

104

105

106

Φ (

co

nd

ucta

nce

)

k (number of nodes in the cluster)

Original networkWhiskers removed

10-2

10-1

100

100

101

102

103

104

Φ (

co

nd

ucta

nce

)

k (number of nodes in the cluster)

Original networkWhiskers removed

(e) Atp-DBLP (f) Gnutella-31

Figure 13: [Best viewed in color.] Network community profile plots with (in red) and without (in green) 1-whiskers, for each of the six networks shown Figure 6. Whiskers were removed as described in the text. In theformer case, we plot results for the full network, and in the latter case, we plot results for the largest bi-connectedcomponent.

arX

iv:1

309.

7440

v1 [

cs.D

S] 2

8 Se

p 20

13

Decompositions of Triangle-Dense Graphs

Rishi Gupta!

Stanford [email protected]

Tim Roughgarden†

Stanford [email protected]

C. Seshadhri

Sandia National Labs, CA‡

[email protected]

Abstract

High triangle density — the graph property stating that most two-hop paths belong to atriangle — is a common signature of social networks. This paper studies triangle-dense graphsfrom a structural perspective. We prove constructively that most of the content of a triangle-dense graph is contained in a disjoint union of radius 2 dense subgraphs. This result quantifiesthe extent to which triangle-dense graphs resemble unions of cliques. We also show that ouralgorithm recovers planted clusterings in stable k-median instances.

1 Introduction

Can the special structure possessed by social networks be exploited algorithmically? Answering thisquestion requires a formal definition of “social network structure.” Extensive work on this topichas generated countless proposals but little consensus (see e.g. [CF06]). The most oft-mentioned(and arguably most validated) statistical properties of social networks include heavy-tailed degreedistributions [BA99, BKM+00, FFF99], a high density of triangles [WS98, SCW+10, UKBM11]and other dense subgraphs or “communities” [For10, GN02, New03, New06, LLDM08], and lowdiameter and the small world property [Kle00a, Kle00b, Kle01, New01].

Much of the recent mathematical work on social networks has focused on the important goalof developing generative models that produce random networks with many of the above statisticalproperties. Well-known examples of such models include preferential attachment [BA99] and relatedcopying models [KRR+00], Kronecker graphs [CZF04, LCK+10], and the Chung-Lu random graphmodel [CL02b, CL02a]. A generative model articulates a hypothesis about what “real-world”social networks look like, and is directly useful for generating synthetic data. Once a particulargenerative model of social networks is adopted, a natural goal is to design algorithms tailoredto perform well on the instances generated by the model. It can also be used as a proxy tostudy the e!ect of random processes (like edge deletions) on a network. Examples of such resultsinclude [AJB00, LAS+08, MS10].

!Supported in part by the ONR PECASE Award of the second author.†This research was supported in part by NSF Awards CCF-1016885 and CCF-1215965, an AFOSR MURI grant,

and an ONR PECASE Award.‡Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a

wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National NuclearSecurity Administration under contract DE-AC04-94AL85000.

1

Page 24: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

(1) Sparse and transitive models (stochastic block or planted partition) naturally lead to local clustering with large p, vanishing r.

(2) this blessing of transitivity makes small clusters easy to estimate (both statistically and algorithmically), even under “semi-parametric” models

A statistical framework for local clustering

Page 25: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Weaving together the two threads.

• Started off talking about local clustering algorithms.

• Then, we forgot that and used SBM + sparsity + transitivity to get a model with small blocks.

• Last section combines these two threads.

• Propose a local clustering algorithm that looks for triangles.

• Study the estimation performance of this algorithm under a local SBM. This is a semi-parametric network model.

Page 26: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

There are global and local versions of our algorithm

Run single linkage hierarchical clustering (i.e. find maximum spanning tree)

[L⌧L⌧ ]ijL⌧ij

Global:

L⌧ij = [D�1/2

⌧ AD�1/2⌧ ]ij

[D⌧ ]ii = deg(i) + ⌧

Define

where

Compute

Page 27: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

There are global and local versions of our algorithm

“regularized graph Laplacian” proposed byChaudhuri, Chung, and Tsiatas (2012)and Amini, Chen, Bickel, and Levina (2012).

tau = average degree is a reasonable choice (Qin, R 2013)

[L⌧L⌧ ]ijL⌧ij

Global:

L⌧ij = [D�1/2

⌧ AD�1/2⌧ ]ij

[D⌧ ]ii = deg(i) + ⌧

Define

where

Compute

Page 28: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Our local algorithm searches for edges in several triangles...

Page 29: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

S

Initialize S = {Jennifer}

Page 30: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

S

Examine edges crossing boundary of S.

Page 31: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

S

If an edge is in a triangle, add that node.

Page 32: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

If an edge is in a triangle, add that node.

S

Page 33: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Iterate, until all crossing edges are not in a triangle.

S

Page 34: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Modification 1

• Tuning parameter: the required number of triangles.

• regulates {size vs density} of cluster

Page 35: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Find the entire path of solutions, for every seed node:

This is a similarity matrix. Use it to perform single linkage hierarchical clustering.

Cut the dendrogram at the required number of triangles.Returns ALL local clusters.

O3ij Number of triangles that contain edge (i,j).

O3 2 Rn⇥n. O3ij = [AA]ijAij

Modification 1

Page 36: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Modification 2.

• Edges from high degree nodes need to be down weighted.

• Replace the adjacency matrix with a row and column normalized version.

Page 37: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

The regularized graph Laplacian

tau = average degree is a reasonable choice (Qin, R 2013)

Proposed for spectral clustering by Chaudhuri, Chung, and Tsiatas (2012) and Amini, Chen, Bickel, and Levina (2012).

[D⌧ ]ii = deg(i) + ⌧

Page 38: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

O3ij = [AA]ijAijReplace

with

Modification 2

O3,Lij = [L⌧L⌧ ]ijL

⌧ij

Page 39: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

The local version of the algorithm only searches over edges connecting to the final cluster.

• To find one branch of the dendrogram . . .2

Algorithm 1 LocalTrans(L, i, cut)

1. Initialize set S to contain node i.2. For each edge (i, `) on the boundary of S

(i 2 S and ` /2 S) calculate O3,Li` :

O3,Li` = Li`

X

k

LikLk`.

3. If there exists any edge(s) (i, `) on the boundary of S with

O3,Li` � cut, then add the corresponding node(s) ` to S and

return to step 2.

4. Return S.

Page 40: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

A local model to study a local algorithm.

Since the algorithm only searches locally, we should have

a “local model.” WAVE HANDS. SHOW MODEL IN TWO PARTS.

Page 41: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Local and semi-parametric Stochastic Blockmodel

• Network model contains three parts(i) the seed node,(ii) its local block,(iii) the rest of the network.

• Assume (iii) is sparse and is independent of edges from {(i) and (ii)} to (iii).

• No parametric / degree distribution / maximum degree assumptions on (iii).

• Asymptotically, (iii) grows and (ii) stays fixed.

• Node degrees in (ii) cannot grow too fast.

(iii) hairball.

(i)

(ii)

Page 42: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Local models• First version is not degree corrected

• good performance with unweighted triangle counting (i.e. use A)

• This algorithm has not worked with real data.

• Second version is degree corrected.

• use L.

• This algorithm has worked with data.

Page 43: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

TheoremIf: 1) correct seeding 2) ambient graph is sparsethen the local algorithm returns the correct clusters whp

Page 44: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

TheoremIf: 1) correct seeding 2) ambient graph is sparsethen the local algorithm returns the correct clusters whp

Does not require:1) growing degrees2) growing s3) specified model on the hairball4) bounded degree in the hairball

Page 45: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Theorem

4

Theorem 2. Under the local Stochastic Blockmodel, ifX

i,j2Sc⇤

Aij n�,

then

(1) cut = 1: for all i 2 S⇤, LocalTrans(A, i, cut = 1) =

S⇤ with probability greater than

1�✓1

2

s2(1� p2in)s�2

+O(p2outns(s + �))

◆.

(2) cut = 2: for all i 2 S⇤, LocalTrans(A, i, cut = 2) =

S⇤ with probability greater than

1��s3(1� p2in)

s�3+O(p3outns(s + �)2)

�.

Probability does not converge to one when s is fixed.

If: 1) correct seeding 2) ambient graph is sparsethen the local algorithm returns the correct clusters whp

Does not require:1) growing degrees2) growing s3) specified model on the hairball4) bounded degree in the hairball

Page 46: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

Similar theorem for the degree corrected model using Ltau

3

Theorem 1. Let A come from the local degree-corrected

Stochastic Blockmodel. Define � such thatX

i,j2Sc⇤

Aij n�.

Set cut = [2(s� 1)pin + 2� + ⌧ ]�3. If n is su�ciently large

and s � 3, then for any i 2 S⇤,

LocalTrans(L⌧ , i, cut) = S⇤

with probability at least

1��1/2 s2(1� p2in)

s�2+ s exp (�1/4(spin + �)) +O(n3✏�1

)

�.

Probability does not converge to one when s is fixed.

Page 47: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

The algorithm finds small communities in large networks.

• Two online social networks. epinions and slashdot.

• Both 70,000+ nodes. Done on my wimpy laptop, in R!

• Data from http://snap.stanford.edu

• Next slides show the induced subgraphs of several local clusters

Page 48: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

●●

●●●

●●

●●●

●●

●● ●

●●

●●

●●

●●●

●●

●●●●●●

●●

●●

●●

●●●

●●●●●

●● ●

●●

●●● ●

1

2

3

456

78

9

10

1112

1314

1516

17

18

19

20

21

2223

24

25

26

2728

29

30

31

32

33

3435

3637

3839 40414243 44

45

46

47

48

49

50

5152

53

54

55

56

57

5859

60

61

626364

65

66676869

70

71

72

73 74

75

7677

78

79

80

81

8283

84 85

86 ●●

●●

●●

●●

● ●●

●●

●●●

●●●

12

34

5

67

8

910

11

12

13

14

15 16

17

18

1920

21

2223

2425

26

27

28

29

30

31

32

33

●●

●●

● ●

●●

●●

●●●●

● ●

● ●

● ●

●●

●●●

●●●

●●

●●

1

23

4

5

67

8 9

10 11

1213

14

15161718

19 20

21

22

23

24

25

2627

28 29

3031

3233

34

35

36

37

3839

40

4142

43

4445

46

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

12

34

567

89

10

11

12

1314

15

1617

18

19

20 21

22

23

24

2526

27

2829

30

31

32

●●

● ●

●●●

●●●●

●●

●●

●●

●●

●●

●●

● ●●

●● ●●

● ●●●

●●●●

●●

●●●

●●

●●●

●●

● ● ●

●●●●●●●●●●●●●●●●

●●●

●●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●●●●●●●●●

●●●

●●

● ●●●

●● ●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●●

●●●

●●

●●

●●●

●●●●

●●● ●

●●

●●

●● ●

●●●

●●

●●

●●●●●

●●

●●

● ●

●●

● 1

2

3

4

5

6

7

8

910

11

12

131415

16

171819

20

21

2223

24

25

26

27

2829

3031

32

33

34

3536

37

3839

40

41

42

43 4445

46

4748 4950

51 525354

55

565758

5960

6162

63

64

65

6667

686970

71

7273

74

75

76

77

78 79 80

81828384858687888990919293949596

97

98

99

100101

102103104

105106

107108

109

110

111112113

114115

116

117

118119

120121

122

123124125

126

127128

129130131132133134135136137138

139

140141142

143

144

145

146

147

148

149 150151

152

153

154 155

156157158

159

160161

162163

164

165166

167

168

169

170

171

172173

174

175

176

177

178

179

180181

182

183

184

185186

187 188

189190

191192

193 194

195

196

197

198

199 200

201202203

204205206

207208

209210

211212213

214

215216217

218

219220221 222

223

224225

226

227228

229

230

231232

233

234

235236237

238

239

240

241242

243

244245

246

247

248249250251252

253254

255

256257

258 259

260

261

262263

264

265

●●● ●●

●●●

●●

●●

●●●

●●●

●●

●●●

●● ●

●●

●●

●●

1

234 56

78 9

10

11

1213

14 15

161718

19

20

21

22

2324

25

26

27

2829

303132

33

34

35

3637 38

39

40

4142

43

44

45

46

47

4849

5051

52

53

● ●

●●●

● ●● ●●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

● 1

2 3

4

56

7

8

9 1011 1213

1415

16

17

18

192021

22

2324

25

26

27

28

29

3031

32

3334

3536

37

38

39

40

41

42

4344

45

46

47

48

49

50

51 ●●●

●● ●

●●

●●

●●

●●

● ●●●

●●

●●●●

● ●

●●

●●

●●●

●●

1

2

3

4

5

6

7

8

9

1011

1213

14

15

16

17

1819

20

21

22

23

24

2526

27

28 293031

32

33

3435

36

37

38

39404142

43 44

45

46

47

48

49

50

51

52

53

54

5556

57

58

59

6061

62

6364 65

66

67

6869

70

71

72

73

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

161718 19

2021

2223

24

2526

27

2829

3031

32

33

3435

36

37

3839

4041

42

●●

● ●

●●

●●

●●

●●

●●

1

2

3

4

5

6

7

8

9

1011

12

13 14

15

16

17

18

1920

21

22

23

24

25

26

27

2829

3031

32

33

34

35

36

37

38

39

40

●●

●●

●●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

1

23

45

67

8

910

1112

13

14

151617

18

1920

2122

2324

25

26

27

28

29

3031

32

33

34

35

3637

38

39

404142

●●

●●

●●

● ●

●●●●

●●

●●●

●●●

● ●

●●

●●

●●

●●●●●● ●

●●

●●

1

2

3

4

5

67

89

1011

12

13

14

15 16

17181920

21

22

2324

25

26

272829

30

31

3233

34

35

3637

38

39

40

41

42

43

4445

46

47

48

49

50

5152

5354 55 56 57

58

59

60

61

62

63

●●●

●●

●● ●●

●●

●●

●●

●●●

●●

●●

● ●●●

●●

●●

●●

●●

12

3

4

5

6

7

8

9

10

11

12

13

14 151617

18

1920

21

2223

2425

26

27

2829

30

31

3233

34

3536

37

38

39 4041

42

4344

45

46

47

48

4950

51

52

53

54

55

56

57

58

59

60

61

62

●●●●

● ●

●● ●

●●

●●●● ●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

● ●

●●

1

2

3

4

5

6

7

8

9

1011

1213

14 15

16

1718

192021

22232425

262728

2930

3132

33

34

35

36

3738

3940

4142

43

44

45

46

47

48

49

5051

52

53

54

55 56

57

585960

61

62

63

6465

66

67 68

69

70

71

7273

74

75

76

●●

●● ●●●

●● ●●

● ●

●●

●●

●●

●●●

●●

● ●

● ●●

● ●●

● ●

●●●●

● ●

●●

●●●

●●●●●

1

2

34

5

67 8910

1112 13

14

15 16

17

18

19

2021

22

23

2425

26

27

28

2930

31

32

33

34

35

36

37

38

39 40

41

42 4344

454647

48 49

50

51

52

53

545556

5758

59 60

61

62

63

64

65

6667

68

69707172

73

74

75

●●

●●

●●●

● ●●

●●

●●●●●

●●●

●●

●●

●●●●

●●

● ●●

●●

●●●●●

●●●●●

●●

●●

●●

●●●

●●

●●●

1

23

45

6

78

9 1011

12

13

14

15

1617

18

19

20212223

24

25

262728

29

30

3132

33

34

353637

38

39

40

41

42

43

44

45

46

47

48 4950

51

52

53

54

5556

575859

60 616263

64

65

66

67

6869

70

71 72

73

74

75

76

7778

79

80

81

8283

84

858687

88 ●

●●

●●

● ●

● ●

● ●●

●●

●●●●

●1

2

34

5

6

7

8

9 10

1112

13

14

15

1617

18

19

20

21

22

23

24

25

26272829

30

31

●●

● ●

●●●● ●●

●●●

●●

●●

●●

●●●

1

2

3

4

5

6 7

891011 12

1314

1516

1718

19

20

21

22

23

24

25

26

27

28

2930

31

32

33

34

35

36

3738

39

● ●

● ●

● ●

●● ●

●●

●●

1

2

3

4 5

6

7 8

9

10

11

12

1314

15

16

17

18 19

20

21

22

23

24

25

26

27

2829

30

31

32

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●●

1

23

4

5

6

78

910

11

12

13

14

15

1617

18

19

20

212223

24

25

26

27

28

2930

31 32

33

34

3536

373839

40

41

42

epinions.com

Page 49: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

●●

●●●

●●

●●●

●●

●● ●

●●

●●

●●

●●●

●●

●●●●●●

●●

●●

●●

●●●

●●●●●

●● ●

●●

●●● ●

1

2

3

456

78

9

10

1112

1314

1516

17

18

19

20

21

2223

24

25

26

2728

29

30

31

32

33

3435

3637

3839 40414243 44

45

46

47

48

49

50

5152

53

54

55

56

57

5859

60

61

626364

65

66676869

70

71

72

73 74

75

7677

78

79

80

81

8283

84 85

86 ●●

●●

●●

●●

● ●●

●●

●●●

●●●

12

34

5

67

8

910

11

12

13

14

15 16

17

18

1920

21

2223

2425

26

27

28

29

30

31

32

33

●●

●●

● ●

●●

●●

●●●●

● ●

● ●

● ●

●●

●●●

●●●

●●

●●

1

23

4

5

67

8 9

10 11

1213

14

15161718

19 20

21

22

23

24

25

2627

28 29

3031

3233

34

35

36

37

3839

40

4142

43

4445

46

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

12

34

567

89

10

11

12

1314

15

1617

18

19

20 21

22

23

24

2526

27

2829

30

31

32

●●

● ●

●●●

●●●●

●●

●●

●●

●●

●●

●●

● ●●

●● ●●

● ●●●

●●●●

●●

●●●

●●

●●●

●●

● ● ●

●●●●●●●●●●●●●●●●

●●●

●●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●●●●●●●●●

●●●

●●

● ●●●

●● ●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●●

●●●

●●

●●

●●●

●●●●

●●● ●

●●

●●

●● ●

●●●

●●

●●

●●●●●

●●

●●

● ●

●●

● 1

2

3

4

5

6

7

8

910

11

12

131415

16

171819

20

21

2223

24

25

26

27

2829

3031

32

33

34

3536

37

3839

40

41

42

43 4445

46

4748 4950

51 525354

55

565758

5960

6162

63

64

65

6667

686970

71

7273

74

75

76

77

78 79 80

81828384858687888990919293949596

97

98

99

100101

102103104

105106

107108

109

110

111112113

114115

116

117

118119

120121

122

123

124125

126

127128

129130131132133134135136137138

139

140141142

143

144

145

146

147

148

149 150151

152

153

154 155

156157158

159

160161

162163

164

165166

167

168

169

170

171

172173

174

175

176

177

178

179

180181

182

183

184

185186

187 188

189190

191192

193 194

195

196

197

198

199 200

201202203

204205206

207208

209210

211212213

214

215216217

218

219220221 222

223

224225

226

227228

229

230

231232

233

234

235236237

238

239

240

241242

243

244245

246

247

248249250251252

253254

255

256257

258 259

260

261

262263

264

265

●●● ●●

●●●

●●

●●

●●●

●●●

●●

●●●

●● ●

●●

●●

●●

1

234 56

789

10

11

1213

14 15

161718

19

20

21

22

2324

25

26

27

2829

303132

33

34

35

3637 38

39

40

4142

43

44

45

46

47

4849

5051

52

53

● ●

●●●

● ●● ●●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

● 1

2 3

4

56

7

8

9 1011 1213

1415

16

17

18

192021

22

2324

25

26

27

28

29

3031

32

3334

3536

37

38

39

40

41

42

4344

45

46

47

48

49

50

51 ●●●

●● ●

●●

●●

●●

●●

● ●●●

●●

●●●●

● ●

●●

●●

●●●

●●

1

2

3

4

5

6

7

8

9

1011

1213

14

15

16

17

1819

20

21

22

23

24

2526

27

28 293031

32

33

3435

36

37

38

39404142

43 44

45

46

47

48

49

50

51

52

53

54

5556

57

58

59

6061

62

6364 65

66

67

6869

70

71

72

73

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

161718 19

2021

2223

24

2526

27

2829

3031

32

33

3435

36

37

3839

4041

42

●●

● ●

●●

●●

●●

●●

●●

1

2

3

4

5

6

7

8

9

1011

12

13 14

15

16

17

18

1920

21

22

23

24

25

26

27

2829

3031

32

33

34

35

36

37

38

39

40

●●

●●

●●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

1

23

45

67

8

910

1112

13

14

151617

18

1920

2122

2324

25

26

27

28

29

3031

32

33

34

35

3637

38

39

404142

●●

●●

●●

● ●

●●●●

●●

●●●

●●●

● ●

●●

●●

●●

●●●●●● ●

●●

●●

1

2

3

4

5

67

89

1011

12

13

14

15 16

17181920

21

22

2324

25

26

272829

30

31

3233

34

35

3637

38

39

40

41

42

43

4445

46

47

48

49

50

5152

5354 55 56 57

58

59

60

61

62

63

●●●

●●

●● ●●

●●

●●

●●

●●●

●●

●●

● ●●●

●●

●●

●●

●●

12

3

4

5

6

7

8

9

10

11

12

13

14 151617

18

1920

21

2223

2425

26

27

2829

30

31

3233

34

3536

37

38

39 4041

42

4344

45

46

47

48

4950

51

52

53

54

55

56

57

58

59

60

61

62

●●●●

● ●

●● ●

●●

●●●● ●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

● ●

●●

1

2

3

4

5

6

7

8

9

1011

1213

14 15

16

1718

192021

22232425

262728

2930

3132

33

34

35

36

3738

3940

4142

43

44

45

46

47

48

49

5051

52

53

54

55 56

57

585960

61

62

63

6465

66

67 68

69

70

71

72

73

74

75

76

●●

●● ●●●

●● ●●

● ●

●●

●●

●●

●●●

●●

● ●

● ●●

● ●●

● ●

●●●●

● ●

●●

●●●

●●●●●

1

2

34

5

67 8910

1112 13

14

15 16

17

18

19

2021

22

23

2425

26

27

28

2930

31

32

33

34

35

36

37

38

39 40

41

42 4344

454647

48 49

50

51

52

53

545556

5758

59 60

61

62

63

64

65

6667

68

69707172

73

74

75

●●

●●

●●●

● ●●

●●

●●●●●

●●●

●●

●●

●●●●

●●

● ●●

●●

●●●●●

●●●●●

●●

●●

●●

●●●

●●

●●●

1

23

45

6

78

9 1011

12

13

14

15

1617

18

19

20212223

24

25

262728

29

30

3132

33

34

353637

38

39

40

41

42

43

44

45

46

47

48 4950

51

52

53

54

5556

575859

60 616263

64

65

66

67

6869

70

71 72

73

74

75

76

7778

79

80

81

8283

84

858687

88 ●

●●

●●

● ●

● ●

● ●●

●●

●●●●

●1

2

34

5

6

7

8

9 10

1112

13

14

15

1617

18

19

20

21

22

23

24

25

26272829

30

31

●●

● ●

●●●● ●●

●●●

●●

●●

●●

●●●

1

2

3

4

5

6 7

891011 12

1314

1516

1718

19

20

21

22

23

24

25

26

27

28

2930

31

32

33

34

35

36

3738

39

● ●

● ●

● ●

●● ●

●●

●●

1

2

3

4 5

6

7 8

9

10

11

12

1314

15

16

17

18 19

20

21

22

23

24

25

26

27

2829

30

31

32

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●●

1

23

4

5

6

78

910

11

12

13

14

15

1617

18

19

20

212223

24

25

26

27

28

2930

31 32

33

34

3536

373839

40

41

42

one cluster in epinions

Page 50: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

●●

●● ●

●●

● ●

● ●

12

3

45 6

7

8

910

11 12

13 14

15

16

●●

●●

●●

●●

1

2

3

4

5 6

7

89

1011

12

13

14

15

1617

18

19

20 ●

●●

●●

●●

●●●●

●●●

●●

●●●●

●●●

●●●●●●

●●●● ●●●

●●

●●●

●●

●●

●●

●●●●

●●●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●●

● ●●●●

●●

●●

1

23

4

56

7

8

910

11

12

13

14

15

1617

18

19

20

21

22232425

262728

29

30

31

3233

34353637

383940

414243444546

47 484950 515253

54

55

56

5758

59

60

61

626364

65

666768

69

70

7172

73

7475

7677 7879

80

81 828384 858687888990

91

9293

94

95

96

9798

99100

101

102103

104

105

106107108

109

110

111112

113114115

116117118119

120121

122

123

124

125126

127

128

129130131

132

133

134135

136

137

138 139140141142

143

144

145

146147148

149150

● ●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●●

●●

●●

●●

●●

●●

● ●

1

2 3

4

5

6

7

8

9

1011

12

1314

151617181920 21

22

2324

25

26

2728

29

30

31

32

3334

35

36

37

38

39

40

41

42

43

4445

46

47

4849

5051

52 5354

55

5657

58

59

60

616263

646566

67

6869

70

7172

73

7475

76

7778

79

80

81

82

838485

86

87

88 89

90

91

92

93

94

95 ●

●●

● ●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

1

2

3

4

5

67

8

910 11

1213 1415

16

17

18

1920

21

222324

2526

272829

3031

32

33

343536

37

38

39

40

41

42

43

44

45

46

47

48

4950

51

52

53

● ●● ●●●

●●

●●

●●

●●

●●

●●

●●●

●● ●

●●

●●

●●

●●

●1

2 34 567

8

910

11

1213

1415

1617

18

19

2021

22

23

2425

26

27

282930

31

3233 34

35

36

37

3839

404142

4344

4546

47

48

49

● ●●●

●●●

●●

1

2

3

4 567

8910

11

12

13

14

15

16

1718

19

20

2122

23

●●●●

●●

●●

1234

5

6

7

8

9

10

1112

13

14

1516

17

18

1920

●●

●●●

●●

●●

●●

●●●

● ●

● ●

● ●

●1

234

5

6

7

8

91011

1213

1415

16

1718

19

202122

23

24

25

26 27

28

29

30

31 32

33

34 35

36

●●

●●

● ●●

●● ●●

1

23

4

56

7

8 910

11

12

13

14

15

16

1718 1920

21

22

23

24

25

26

27

28

29●●

●●

12

3

4

5

6

7

8

910

1112

13

14

15

16

17

● ●

●●●

● ●●

● ●

●●

1

2

3

4

5

6 7

8

9

10

11

12

13

14

151617

18

19

20

21

22 2324

25

26

27 28

29

3031

●●●

●●

●●

●●

●●

1

2

3

4

567

8

9

10

11

12

13

14

1516

17

1819

20

21

22

2324

2526

27

28

● ●

●●

● ●

● ●

●●

●●

●●

●●

1

2 3

4

5

6

7

8

9

1011

12

13

14

1516 17

18 19

2021

22

2324

25

26

27

28

29

30

31

3233

34

35

3637 ●

●● ●●

●●

●●●

●●●●

●●

●●

● ●●

12

3

4

5

6

7

8

9

10

11

12

13

14

1516 1718

1920

21

22

23

24

25

262728

29303132

3334

35

36

37

38

39

4041

42

43 4445

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●1 2

3

4

5

6

78

910

11

1213

14

1516

17

1819

20

21

22

2324

2526 27

28

29

30 31

3233

34

35

36

3738

39

40

41

42

4344

4546

●●

●●

●●

●●

●●●

1

2

3

4

5

6

7

8

910

11

12

13

14

1516

1718

19

2021

22

23

242526

27

●●●●

● ●

●●

●●●

●●

●●

●●

●●

●●

1

2

3

4567

8

91011 12

1314

151617

18

1920

2122

2324

25

26

27 28

29

3031

32

33

●●

●● ●

●●

1

2

3

45

6

78 9

10

11

12

131415

16

●●

●●

●●

1

2

34

5

6

7

8

9

10

1112

1314

15

16

●●

●●

●●● ●

●●

●●●

●●

●●●●

12

3

4

5

6

78

9

101112 13

14

15

16

1718

19

20

212223

24

2526

27282930

31

●●

●●

●●

●●

●●

●●

●●

1

23

4

5

67

89

1011

12

1314

15

1617

1819

2021

22

23

24

25

● ●

●●

●●●

●●

●●

● ●●

●●

1

2 3

45

6

7 89

10

11

12

1314

1516

17

18

19 2021

2223

24

●●

●●

●●

●●

●●● ●

●1

2

34

5

6

7

89

10

11

12

13

14

15

16

17

1819

2021

22

23

242526 27

28

●●

●●

●12

3

4

5

67

8

9

10

11

12

13

14

15

16

17

●●●

●●●

●●

● ●●

●●

●●

1

2

3

4

5

6

7

8910

11

12

13

14

15

161718

1920

2122

23

24

25 2627

28

2930

31

32

3334

35

● ●

●●

● ●

●1

2

3

4 5

6

7

89

10

11

12

13

1415

16

17

18

19 20

21

22

23

slashdot.org

Page 51: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

●●

●● ●

●●

● ●

● ●

12

3

45 6

7

8

910

11 12

13 14

15

16

●●

●●

●●

●●

1

2

3

4

5 6

7

89

1011

12

13

14

15

1617

18

19

20 ●

●●

●●

●●

●●●●

●●●

●●

●●●●

●●●

●●●●●●

●●●● ●●●

●●

●●●

●●

●●

●●

●●●●

●●●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●●

● ●●●●

●●

●●

1

23

4

56

7

8

910

11

12

13

14

15

1617

18

19

20

21

22232425

262728

29

30

31

3233

34353637

383940

414243444546

47 484950 515253

54

55

56

5758

59

60

61

626364

65

666768

69

70

7172

73

7475

76777879

80

81 828384

858687888990

91

9293

94

95

96

9798

99100

101

102103

104

105

106107108

109

110

111112

113114115

116117118119

120121

122

123

124

125126

127

128

129130131

132

133

134135

136

137

138 139140141142

143

144

145

146

147148

149150

● ●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●●

●●

●●

●●

●●

●●

● ●

1

2 3

4

5

6

7

8

9

1011

12

1314

151617181920 21

22

2324

25

26

2728

29

30

31

32

3334

35

36

37

38

39

40

41

42

43

4445

46

47

4849

5051

52 5354

55

5657

58

59

60

616263

646566

67

6869

70

7172

73

7475

76

7778

79

80

81

82

838485

86

87

88 89

90

91

92

93

94

95 ●

●●

● ●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

1

2

3

4

5

67

8

910 11

1213 1415

16

17

18

1920

21

2223

24

2526

272829

3031

32

33

343536

37

38

39

40

41

42

43

44

45

46

47

48

4950

51

52

53

● ●● ●●●

●●

●●

●●

●●

●●

●●

●●●

●● ●

●●

●●

●●

●●

●1

2 34 567

8

910

11

1213

1415

16

1718

19

2021

22

23

2425

26

27

282930

31

3233 34

35

36

37

3839

404142

4344

4546

47

48

49

● ●●●

●●●

●●

1

2

3

4 567

8910

11

12

13

14

15

16

1718

19

20

2122

23

●●●●

●●

●●

1234

5

6

7

8

9

10

1112

13

14

1516

17

18

1920

●●

●●●

●●

●●

●●

●●●

● ●

● ●

● ●

●1

234

5

6

7

8

91011

1213

1415

16

1718

19

202122

23

24

25

26 27

28

29

30

31 32

33

34 35

36

●●

●●

● ●●

●● ●●

1

23

4

56

7

8 910

11

12

13

14

15

16

1718 1920

21

22

23

24

25

26

27

28

29●●

●●

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

● ●

●●●

● ●●

● ●

●●

1

2

3

4

5

6 7

8

9

10

11

12

13

14

151617

18

19

20

21

22 2324

25

26

27 28

29

3031

●●●

●●

●●

●●

●●

1

2

3

4

567

8

9

10

11

12

13

14

1516

17

1819

20

21

22

2324

2526

27

28

● ●

●●

● ●

● ●

●●

●●

●●

●●

1

2 3

4

5

6

7

8

9

1011

12

13

14

1516 17

18 19

2021

22

2324

25

26

27

28

29

30

31

3233

34

35

3637 ●

●● ●●

●●

●●●

●●●●

●●

●●

● ●●

12

3

4

5

6

7

8

9

10

11

12

13

14

1516 1718

1920

21

22

23

24

25

262728

29303132

3334

35

36

37

38

39

4041

42

43 4445

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●1 2

3

4

5

6

78

910

11

1213

14

1516

17

1819

20

21

22

2324

2526 27

28

29

30 31

3233

34

35

36

3738

39

40

41

42

4344

4546

●●

●●

●●

●●

●●●

1

2

3

4

5

6

7

8

910

11

12

13

14

1516

1718

19

2021

22

23

242526

27

●●●●

● ●

●●

●●●

●●

●●

●●

●●

●●

1

2

3

4567

8

91011 12

1314

151617

18

1920

2122

2324

25

26

27 28

29

3031

32

33

●●

●● ●

●●

1

2

3

45

6

78 9

10

11

12

131415

16

●●

●●

●●

1

2

34

5

6

7

8

9

10

1112

1314

15

16

●●

●●

●●● ●

●●

●●●

●●

●●●●

12

3

4

5

6

78

9

101112 13

14

15

16

1718

19

20

212223

24

2526

27282930

31

●●

●●

●●

●●

●●

●●

●●

1

23

4

5

67

89

1011

12

1314

15

1617

1819

2021

22

23

24

25

● ●

●●

●●●

●●

●●

● ●●

●●

1

2 3

45

6

7 89

10

11

12

1314

1516

17

18

19 2021

2223

24

●●

●●

●●

●●

●●● ●

●1

2

34

5

6

7

89

10

11

12

13

14

15

16

17

1819

2021

22

23

242526 27

28

●●

●●

●12

3

4

5

67

8

9

10

11

12

13

14

15

16

17

●●●

●●●

●●

● ●●

●●

●●

1

2

3

4

5

6

7

8910

11

12

13

14

15

161718

19

20

2122

23

24

25 2627

28

2930

31

32

3334

35

● ●

●●

● ●

●1

2

3

4 5

6

7

89

10

11

12

13

1415

16

17

18

19 20

21

22

23

one cluster in slashdot.

Page 52: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

1. Local clustering is justified for several reasons.

a) new types of questions with massive networks, Dunbar’s number and other empirical evidence.

b) computation, visualization, interpretation, diagnostics

2. Since we want to make inferences, we need a model with local clusters. Requires caution!

a) Forgetting the motivation of local clusters...

b) Sparse and transitive SBMs have small blocks.

3. Blessing of transitivity allows fast (local) algorithms and local inference with drastically reduced “global assumptions”

Recap

Page 53: Inferring communities in large networks; · PageRank vectors, and derive a mixing result for PageRank vectors similar to that for random walks. Using this mixing result, we derive

1. Local clustering is justified for several reasons.

a) new types of questions with massive networks, Dunbar’s number and other empirical evidence.

b) computation, visualization, interpretation, diagnostics

2. Since we want to make inferences, we need a model with local clusters. Requires caution!

a) Forgetting the motivation of local clusters...

b) Sparse and transitive SBMs have small blocks.

3. Blessing of transitivity allows fast (local) algorithms and local inference with drastically reduced “global assumptions”

Recap Thank you!