Spectral Clustering Algorithms and High-dim Stochastic ...•Stat approach: how does algorithm perform on tractable stochastic models with clearly deﬁned population clusters •Stochastic

Spectral Clustering Algorithms and High-dim Stochastic

Blockmodels for Undirected and Directed Graphs

Bin YuStatistics and EECS, UC Berkeley

based on Rohe, Chatterjee, and Yu (AoS, 2011)and Rohe and Yu (in prep, 2011)

IMA Workshop on High Dim PhenomenonSeptember 26, 2011

1

Outline1. Spectral Clustering for Undirected Graphs: review

2. High-dim Analysis of Spectral Clustering under Stochastic Blockmodel

3. Di-Sim: Spectral Clustering for Directed Graphs

4. The Directed Stochastic Blockmodel and High-dim Analysis of Di-Sim under this model

5. Preliminary analysis of Enron data

2

Networks represent interacting actors

Community structure in large networks 21

(a) Zachary’s karate club network . . .

0.1

1

1 10

! (

co

nd

ucta

nce

)

k (number of nodes in the cluster)

cut Acut A+B

(b) . . . and it’s community profile plot

(c) Dolphins social network . . .

0.01

0.1

1

1 10 100

! (

co

nd

ucta

nce

)


cut

(d) . . . and it’s community profile plot

(e) Monks social network . . .

0.1

1

1 10

! (

co

nd

ucta

nce

)


cut

(f) . . . and it’s community profile plot

(g) Network science network . . .

0.001

0.01

0.1

1

1 10 100

! (

co

nd

ucta

nce

)


A

BC

D

C+E

(h) . . . and it’s community profile plot

Figure 5: Depiction of several small social networks that are common test sets for community detection algorithmsand their network community profile plots. (5(a)–5(b)) Zachary’s karate club network. (5(c)–5(d)) A network ofdolphins. (5(e)–5(f)) A network of monks. (5(g)–5(h)) A network of researchers researching networks.

Leskovec et. al 2008

The Spread of Obesity in a Large Social Network Over 32 Years

n engl j med 357;4 www.nejm.org july 26, 2007 373

educational level; the ego’s obesity status at the previous time point (t); and most pertinent, the alter’s obesity status at times t and t + 1.25 We used generalized estimating equations to account for multiple observations of the same ego across examinations and across ego–alter pairs.26 We assumed an independent working correlation structure for the clusters.26,27

The use of a time-lagged dependent variable (lagged to the previous examination) eliminated serial correlation in the errors (evaluated with a Lagrange multiplier test28) and also substantial-ly controlled for the ego’s genetic endowment and any intrinsic, stable predisposition to obesity. The use of a lagged independent variable for an alter’s weight status controlled for homophily.25 The key variable of interest was an alter’s obesity at time t + 1. A significant coefficient for this vari-able would suggest either that an alter’s weight affected an ego’s weight or that an ego and an alter experienced contemporaneous events affect-

ing both their weights. We estimated these mod-els in varied ego–alter pair types.

To evaluate the possibility that omitted vari-ables or unobserved events might explain the as-sociations, we examined how the type or direc-tion of the social relationship between the ego and the alter affected the association between the ego’s obesity and the alter’s obesity. For example, if unobserved factors drove the association be-tween the ego’s obesity and the alter’s obesity, then the directionality of friendship should not have been relevant.

We evaluated the role of a possible spread in smoking-cessation behavior as a contributor to the spread of obesity by adding variables for the smoking status of egos and alters at times t and t + 1 to the foregoing models. We also analyzed the role of geographic distance between egos and alters by adding such a variable.

We calculated 95% confidence intervals by sim-ulating the first difference in the alter’s contem-

Figure!1.!Largest!Connected!Subcomponent!of!the!Social!Network!in!the!Framingham!Heart!Study!in!the!Year!2000.

Each circle (node) represents one person in the data set. There are 2200 persons in this subcomponent of the social network. Circles with red borders denote women, and circles with blue borders denote men. The size of each circle is proportional to the person’s body-mass index. The interior color of the circles indicates the person’s obesity status: yellow denotes an obese person (body-mass index, !30) and green denotes a nonobese person. The colors of the ties between the nodes indicate the relationship between them: purple denotes a friendship or marital tie and orange denotes a familial tie.

Copyright © 2007 Massachusetts Medical Society. All rights reserved. Downloaded from www.nejm.org at UC SHARED JOURNAL COLLECTION on August 11, 2009 .

Christakis and Fowler 2007Read Russ Lyons’ Critique

Defined by nodes and edgesAppear in diverse applicationsModern networks are massiveClusters/communities are importantand have been focus of research.We want to find them.

Community structure in large networks 21

(a) Zachary’s karate club network . . .

0.1

1

1 10

! (

co

nd

ucta

nce

)


cut Acut A+B

(b) . . . and it’s community profile plot

(c) Dolphins social network . . .

0.01

0.1

1

1 10 100

! (

co

nd

ucta

nce

)


cut

(d) . . . and it’s community profile plot

(e) Monks social network . . .

0.1

1

1 10

! (

co

nd

ucta

nce

)k (number of nodes in the cluster)

cut

(f) . . . and it’s community profile plot

(g) Network science network . . .

0.001

0.01

0.1

1

1 10 100

! (

co

nd

ucta

nce

)


A

BC

D

C+E

(h) . . . and it’s community profile plot

Figure 5: Depiction of several small social networks that are common test sets for community detection algorithmsand their network community profile plots. (5(a)–5(b)) Zachary’s karate club network. (5(c)–5(d)) A network ofdolphins. (5(e)–5(f)) A network of monks. (5(g)–5(h)) A network of researchers researching networks.

3

Community detection is a difficult problem with an extensive literature.

The graph contains a vertex set and an edge set

One formulation:

measures the quality of a partition.

Often this problem is NP hard.

fG

x

Ck =

�(C1, . . . , Ck) : ∀i �= j, Ci ∩ Cj = ∅ &

�

i

Ci = V

�

x

G = (V,E)

arg max(C1,...,Ck)∈Ck

fG(C1, . . . , Ck)

4

Spectral clustering• Finds connected components in an electrical

network [Fiedler, 1973]• Spectral graph theory [Chung, 1997]• Convex relaxation of graph cut problem [Shi and Malik, 2000, Ng et al 2002]• Intimate connection to electrical network

theory and random walks on graphs. [Klein and Randic, 1993, Meila and Shi, 2001]• Two good reviews [Luxburg, 2007, Spielman, 2009]

5

Notation

W ∈ Rn×n

Wij = Wji =�

1 if (i, j) ∈ E0 otherwise.

x

Adjacency Symmetric Matrix

D ∈ Rn×n

x

Dii =�

j

Wij

Degree Matrix (Diagonal)

6

Different Laplacians

Normalized symmetric graph Laplacian

7

RijLN = I −D−1/2WD−1/2

Symmetric graph Laplacian L0 = D −W

Random walk Laplacian Lrw = I −D−1W

[Our (normalized)]symmetric graph Laplacian

L = D−1/2WD−1/2

Some properties of their eigen systems

8

1. (1,..., 1) is always an eigenvector ofwith eigenvalue 0.

2. The multiplicity of 0 as an eigenvalue of is equal to the number of connected componentsof G.

3. and share the same eigen vectors and theireigenvalues add up to 1.

L0

L0

LN L

The principal eigenvector of L

u(I − L)u = u(I −D−1/2WD−1/2)u = uD−1/2(D −W )D−1/2u = u(D −W )u =n�

i=1

u2i Dii +

n�

i,j=1

uiujWij =12

n�

i=1

u2i Dii + 2

n�

i,j=1

uiujWij +n�

j=1

u2jDjj

=

12

�

ij

Wij(ui − uj)2 =12

�

ij

Wij(uiD−1/2ii − ujD

−1/2jj )2

Thus, I - L is positive semi-definite. If, u s.t. uii = D1/2ii

then ‘ . As a result, u is the eigenvector with the smallest eigenvalue for I - L. Also, (1) all eigenvalues of L are less than or equal to one.(2) u is an eigenvector of L with eigenvalue 1.

u(I − L)u = 0

9

• Second eigenvalue of normalized Laplacian (and our Laplacian) is closely connected to conductance of a graph via Cheeger’s inequality.

• This eigenvalue also essentially control the mixing speed of a random walk on the graph.

10

Connecting Laplacians with graph partition or cut

The spectral clustering algorithm

Input: Adjacency matrix W ∈ {0,1}n×n.

1. Find the eigenvectors X1, . . . , Xk ∈ Rncorresponding

to the k largest eigenvalues of Laplacian L. L is

symmetric, so choose these eigenvectors to be orthogonal.

2. Form the eigenvector matrix X = [X1, . . . , Xk] ∈Rn×k

by putting the eigenvectors into the columns.

2. Treating each of the n rows in X as a point in

Rk, run k-means. This creates k non-overlapping sets

A1, . . . , Ak whose union is 1, . . . , n.

Output: A1, . . . , Ak. This means that node i is assigned

to cluster g if the ith row of X is assigned to Ag in

step 2.

11

has k distinct rows

R(n, k) = {ST : where S ∈ Rn×k is a partition matrix and T ∈ Rk×k}• Contains matrices with k columns and n rows, but only k unique rows.

• Unique rows correspond to the k centers in the k-means objective function

minM∈R(n,k)

�X −M�2F = min

{m1,...,mk}⊂Rk

�

i

ming�xi −mg�2

2

µ = argminM∈R(n,k)�X −M�2F .

µ ∈ Rn×k

k-means:

12

A pictorial representation of the algorithm

W = =⇒

!!!

!!

!

!!!

!

!!

!!!!

! !

!

!

!

!! !! !

!

!

!

!

!!

!

!

!!

!!

!

!

!

!

!

!!

!

!

!

!!

!

! !!!

!

!

!

!!

!! !!

!

!!

!!

!!!

!

!!

!

!!!! !

!!!!!

!!

!

!

!

!

!

!!!

!!

!

!!!

!

!!

!!!!

! !

!

!

!

!! !! !

!

!

!

!

!!

!

!

!!

!!

!

!

!

!

!

!!

!

!

!

!!

!

! !!!

!

!

!

!!

!! !!

!

!!

!!

!!!

!

!!

!

!!!! !

!!!!!

!!

!

!

First two eigenvectors

Eigenvector Vectorswith k-means centers

13

Two types of community detection algorithms

Parametric model fitting Algorithmic and data driven

Given a network model where community memberships have a parametric representation, estimate these parameters, the “true” clusters.

1. Motivated by observations, heuristics, and insights into the appearance of clusters

2. Hopes to find plausible clusters.

• Nowicki and Snijders [2001]• Handcock et al. [2007] • Airoldi et al. [2008]

• Spectral Clustering• Girvan-Newman

Recent examples Recent Examples

14

• Empirical performance validation through down-stream analysis.

• CS approach: could show that particular algorithms yield approximate solutions that are close to the true empirical optimum for a particular set of network data.

Why should you trust an algorithm?

15

• Stat approach: how does algorithm perform on tractable stochastic models with clearly defined population clusters

• Stochastic Block Model (SBM):

• each node belongs to a block

• edges are independent Bernoulli variables

• probability of an edge between two nodes depends only on the block membership of those two nodes.

Why should you trust an algorithm?

16

• Previous consistency results fix the number of blocks.

• Bickel and Chen, Snijders and Nowicki, ...

• What about “high-dimensional” setting, where the number of blocks grows with the number of nodes?

• The “best” communities in a wide variety of empirical networks are much smaller than the number of nodes and have sizes of a few hundreds. Leskovec et. al 2008

• Dunbar’s number from biology

Previous results are not sufficient.

17






18

Our results assess the performance of a useful algorithm Spectral Clustering under a

stochastic parametric model

The StochasticBlock Model

Statistical approach

SpectralClustering

Computer Science approach

Parametric model

19

Algorithmic and data driven

Our results are high-dimensional.

• First results analyzing spectral clustering to stochastic block model and allowing number of clusters/groups to grow.

• Only get a bound on the number of “misclustered” nodes. Not consistency like other algorithms

• Convergence of eigenvectors: a novel result in its own right

20

•Edges: independent, Bernoulli •Divide nodes into k blocks.•Let each block have an equal proportion of the nodes.• For simplicity, assume probability of an edge between two nodes is if both nodes in same block otherwiserp

Results under a High-dim Stochastic Block Model

Clustering: estimate block membership for each node21

L = (ED)−1/2EW (ED)−1/2

Columns are unit length eigenvectors ofIt can be shown that it has k unique rows under SBM

L

Orthonormal rotation that minimizes

�XO − µ�F (Don’t worry about this)

µ ∈ Rn×k

O ∈ Rk×k

More notation

The centers from the k-means appliedon the rows of . Only k unique rows. X

Columns are unit length eigenvectors of LX ∈ Rn×k

µ ∈ Rn×k

22

Results bound the number of “misclustered” nodes

• To show that spectral clustering performs well on the Stochastic Block Model, bound the number of “misclustered” nodes.

• We need a definition of “misclustered.”

23

What is “misclustered”

These k unique rows identify the membership into the k true clusters and the k estimated clusters.

and each have n rows, one for each node.µµ

Each has exactly k unique rows.

24

If , then the corresponding nodes are in the same block (true cluster).

µi = µj


If , then spectral clustering estimates that the corresponding nodes are in the same block.

µi = µj

th row of corresponding matrix.iµi, µi ∈ Rk

25


B = {i : �µiO − µi�2 >c

2}.

Define the set of misclustered nodes:

�µi − µj�2 > c

Choose such that for any and that do not belong to the same block,

c i j

�µiO − µi�2 <c

2=⇒ �µiO − µi�2 < �µiO − µj�2Then,

26

Results for the high-dim SBMUnder the simple Stochastic Block Model described above, for >0 and under conditions on the eigenvalues of , and the min expected degree of nodes is of O(n), the number of mis-clustered nodes is bounded:

Ignoring rounding problems,

|B| = o�k3(log(n))2

�a.s.

k = nα for α < 1/3 =⇒ |B|n

= o(1)

p �= r

27

L

LL

L

Main proof ideas: concentration+Davis-Kahan

�LL− LL�F → 0

Lx = λx =⇒ LLx = λ2xKey: Concentration happens for LL, but not for L.

Davis-Kahan Theorem bounds the difference between eigenvectors.

�XO − µ�F → 028

Proof for K-means step

|B| ≤�

i∈B1 ≤ 2

c2

�

i∈B�µiO − µi�22

≤ 2c2�µO − µ�2F ≤

8c2�XO − µ�2F → 0

�µO − µ�F ≤ �µO −XO�F + �XO − µ�F

≤ 2�XO − µ�F






30

• As a result of the Enron investigation, a large portion of the corporation’s emails between Nov. 1998 and June 2002 became public.

• These emails create a directed network on 154 employees.

• is the number of emails from i to j.Aij

Enron data

31

Spectral clustering for directed graphs

• Relationships often have a direction

• This makes the graph Laplacian asymmetric

• Standard spectral clustering would produce complex eigenvectors.

• The next slides propose and examine a spectral clustering algorithm that replaces the eigen-decomposition with singular value decomposition.

32

NotationThe adjacency matrix for a directed graph is asymmetric.

Wij =�

1 if i→ j0 otherwise.

x

W ∈ {0, 1}n×n

33

Alternatively, the adjacency matrix can be weighted:

if i→ j, Wij > 0

(1)

Pii =�

k

Wki =�

k

1{k → i} (2)

Oii =�

k

Wik =�

k

1{i→ k} (3)

Lij =Wij�OiiPjj

=1{i→ j}�

OiiPjj. (4)

(5)

Directed graph Laplacian

34

Directed graph Laplacian

35

(1)

Pii =�

k

Wki =�

k

1{k → i} (2)

Oii =�

k

Wik =�

k

1{i→ k} (3)

Lij =Wij�OiiPjj

=1{i→ j}�

OiiPjj. (4)

(5)

If the adjacency matrix is unweighted . . . number of parents

or in-links:

number of offspringor out-links:

36

Di-Sim (pronounced “Dice ‘em”)

Input: Adjacency matrix W ∈ {0, 1}n×n, number of clusters k.

1. Compute the singular value decomposition of L = UΣV T .

Remove the columns of U and V that correspond to the

n− k smallest singular values in Σ. Call the resulting

matrices Uk ∈ Rn×k and V k ∈ Rn×k.

2. Cluster the nodes based on similar parents by treating each

row of V k as a point in Rk. Cluster these points with k-means.

Because each row of V k corresponds to a node in the graph,

the resulting clusters are clusters of the nodes.

3. Cluster the nodes based on similar children by performing

step two on the matrix Uk.

Output: the clusters from Steps 2 and 3.

New Directed Spectral Clustering Algorithm

Di-Sim gives distinct clusterings

• The directed spectral clustering algorithm creates two sets of clusters.

• To conceptualize these clusters, the next sides explain SVD.

37

SVD

SVD: L = UΣV T

U and V contain the eigenvectorsto the symmetric, positive semi-definitematrices LLT and LT L respectively.If L is symmetric, U containsthe eigenvectors of M and U = V .

38

SVD

SVD: L = UΣV T

U and V contain the eigenvectorsto the symmetric, positive semi-definitematrices LLT and LT L respectively.If L is symmetric, U containsthe eigenvectors of M and U = V .

39

Interpreting LT L LLT

x is a common parent (in-link) of a and b.

40

(LTL)ab =

1√PaaPbb

�

x

1{x→ a and x→ b}Oxx

(LLT )ab =

1�OiiOjj

�

x

1{a→ x and b→ x}Pxx

Interpreting LT L LLT

41

(LTL)ab =

1√PaaPbb

�

x


(LLT )ab =

1�OiiOjj

�

x


a and b have the common offspring (out-link) x.

The graph Laplacian

Parents with many offspring (out-links) or offsprings with many parents (in-links) are down weighted.

42

(LTL)ab =

1√PaaPbb

�

x


(LLT )ab =

1�OiiOjj

�

x


Di-Sim uses the eigenvectors of two similarity matrices

a b

x a b

x

Weighted count of common offsprings (out-links)

Weighted count of common parents (in-links)

Two symmetric notions of similarity in directed graphs:

43






44

Two similarity measures, two clusterings

• Di-Sim creates estimates two sets of clusters based on two symmetric similarity measures.

• To analyze the estimation properties of Di-Sim, we use the Directed Stochastic Blockmodel (D-SBM)

• it is parameterized by two partitions.

45

A Directed Stochastic Blockmodel (D-SBM)

Edges are independent, Bernoulli variables with

The first generalization of the Stochastic Blockmodel to directed edges uses a single partition of the nodes and allows reciprocal edges to be dependent. (Wang and Wong, 1987)

B ∈ [0, 1]k×k

These fixed matrices partition the nodes. Each row contains exactly one 1.(Every column contains at least one 1.)

contains probabilities.

S, R ∈ {0, 1}n×k

E(W ) = SBRT

Example

47

=

1 01 01 00 10 1

�.5 .1.06 .7

� �1 1 0 0 00 0 1 1 1

�=

.5 .5 .1 .1 .1

.5 .5 .1 .1 .1

.5 .5 .1 .1 .1.06 .06 .7 .7 .7.06 .06 .7 .7 .7

.

B E(W )=S RT

Example

48

=

1 01 01 00 10 1

�.5 .1.06 .7

� �1 1 0 0 00 0 1 1 1

�=

.5 .5 .1 .1 .1

.5 .5 .1 .1 .1

.5 .5 .1 .1 .1.06 .06 .7 .7 .7.06 .06 .7 .7 .7

.

B E(W )=S RT

If the i and jth rows of S are equal, then i and j send edges to several of the same nodes.

If the i and jth rows of R are equal, then i and j receive edges from several of the same nodes.

LT L LLT

Simulation with the Directed Stochastic Blockmodel3 blocks in both S and R. 40 nodes in each block.memberships in S and R are random and independent

of each other. diagonal of B = .4 off diagonal = .1

Simulation of and

49

LT L LLT

ordered by

R

S50

“common parents” “common offspring”

and appear approximately low rank. They contain information relating to S and R respectively.

• SVD can discover the block structures in the Directed Stochastic Blockmodel.

• The left and right singular vectors estimate the partitions in S and R respectively.

• In this simulation, the left singular vectors reveal absolutely nothing about the partition in R. Similarly for the right vectors and S.

LT L LLT

51

The two sets of clusters in Di-Sim estimate the two partitions in the Directed Stochastic Blockmodel.

Similiar theoretical result hold:

proprotions of misclustered nodes go to zero with high prob. under appropriate conditions.

52






53

A very preliminary look at Enron data via Di-Sim

• Compute the left and right singular vectors of the directed graph Laplacian.

• Look at the 2nd, 3rd, 4th,and 5th vectors.

• This creates 2 matrices with 154 rows and 4 columns.

• Two rows in the matrix of left singular vectors are closeif those two employees send messages to several of the same people.

●

●●

●

●

●●●●

●●

●●●●

●●●●

●●●●

●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Index

s$d

singular values

54

How different are the left and right singular vectors in the Enron data?

U, V ∈ R154×4Left and right singular vectors

To compare the rows of U to the rows of V, make sure they are “aligned” O

∗ = argminO:OT O=I�U − V O�F

Euclidean distance between rows of and rows of U V O

∗

sqrt(rowSums((uo − v)^2))

Freq

uenc

y

0.0 0.1 0.2 0.3 0.4 0.5 0.6

010

2030

55

sqrt(rowSums((uo − v)^2))

Freq

uenc

y

0.0 0.1 0.2 0.3 0.4 0.5 0.6

010

2030

The outlier is Bill Williams. From the New York Times:"This is going to be a word-of-mouth kind of thing," Williams says on the tape. "We want you guys to get a little creative and come up with a reason to go down." After agreeing to take the plant down, the Nevada official questioned the reason. "O.K., so we're just coming down for some maintenance, like a forced outage type of thing?" Rich asks. "And that's cool?"

"Hopefully," Williams says, before both men laugh.

The next day, Jan. 17, 2001, as the plant was taken out of service, California called a power emergency, and rolling blackouts affected up to a half-million consumers, according to daily logs of the western power grid.

Tapes reveal Enron took a role in crisisBy Timothy EganPublished: Saturday, February 5, 2005

56

Un- vs directed analyses: Williams

57

← A+A�

Differences in un- and directed analyses:

two women

58

Differences: another bad guy?

59

He is Enron’s governmental affairs executive Dasovich

60

government’s argument on this issue: “All Enron public statements that I have read, including, I think, statements

made to financial analysts, state that our exposure in California is not ‘material.’”

Jeff Dasovich’s email To: Jeff Dasovich [Enron governmental affairs executive]From: Jeff DasovichDate: Sept. 14, 2000Subject: Observations on the Hearings this Week

...Observation – The pressure to finger somebody for ”price gouging” [inCalifornia] is increasing. The administration is hell bent on finding a ”fallguy.” The price spikes pose real political risks for [California Gov. Gray]Davis and he and his folks need and want an easy way out. His press releasefollowing the hearing renewed the call for ”refunds” ... The utilities repeat-edly called on FERC to do a ”real” investigation, with hearings, testimony,data discovery – the works...

Implication – It seems prudent for Enron to understand better its risks ofgetting fingered. In the best case, the clamoring for a ”refund” subsides. Inwhich case, the only cost to Enron is the internal cost incurred to understandbetter the risks of getting fingered. In the medium case, investigations findthat Enron (like others) ”played by the rules,” but the rules stunk, and Enronprofited at the expense of California consumers.

1

Jeff Dasovich Today

62

Parting Messages

• Useful to analyze practical algorithms under different generative models for insights.

• Good theoretical property for Spectral Clustering under SBM

• Insight from analysis motivates Di-Sim: a new spectral clustering EDA method for directed graphs

• Di-sim applied to Enron data gives “interesting” results

63

• J. Leskovec, K.J. Lang, A. Dasgupta, and M.W. Mahoney. Statistical properties of community structure in large social and information networks. In Proc 17th International Conference on WWW. ACM, 2008.

• N.A. Christakis and J.H. Fowler. The spread of obesity in a large social network over 32 years. New England Journal of Medicine, 357(4):370, 2007. ( Read Russ Lyons’ Critique)

• M. Girvan and MEJ Newman. Community structure in social and biological networks. Proc. of the National Academy of Sciences, 99(12):7821, 2002.

• J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Aattern Analysis and Machine Intelligence, 22(8):888–905, 2000.

• M. Meilă and J. Shi. A random walks view of spectral segmentation. AI and Statistics, 2001.

• DJ Klein and M. Randić. Resistance distance. Journal of Mathematical Chemistry, 12(1):81–95, 1993.

References

64

• M. Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23(2):298–305, 1973.

• E.M. Airoldi, D.M. Blei, S.E. Fienberg, and E.P. Xing. Mixed membership stochastic blockmodels. The Journal of Machine Learning Research, 9:1981–2014, 2008.

• M.S. Handcock, A.E. Raftery, and J.M. Tantrum. Model-based clustering for social networks. Journal of the Royal Statistical Society-Series A, 170(2):301–354, 2007.

• K. Nowicki and T.A.B. Snijders. Estimation and Prediction for Stochastic Blockstructures. Journal of the American Statistical Association, 96(455), 2001.

• T.A.B. Snijders and K. Nowicki. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of Classification, 14(1):75–100, 1997.

• P.J. Bickel and A. Chen. A nonparametric view of network models and Newman–Girvan and other modularities. Proceedings of the National Academy of Sciences, 106(50):21068, 2009.

• Y.J. Wang and G.Y. Wong. Stochastic blockmodels for directed graphs. Journal of the American Statistical Association, 82(397):8–19, 1987. ISSN 0162-1459.

References

65

Other theoretical analyses of spectral clustering

• Under manifold assumptions: Belkin and Niyogi, 2003, 2005, Lafton, 2004, Gine and Kolchinski, 2005, Coifman adn Lafton, 2006, ...

• Under mixture of Gaussian assumptions: Shi, Belkin and Y (2009)

66

The two sets of clusters in Di-Sim estimate the two partitions in the Directed Stochastic Blockmodel10 K. ROHE, S. CHATTERJEE AND B. YU

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

Recall that D (n)ii is the expected degree for node i. So, !n is the minimum expected

degree, divided by the maximum possible degree. It measures how quickly thenumber of edges accumulates.

THEOREM 2.1. Define the sequence of random matrices W(n) ! {0,1}n"n

to be from a sequence of latent space models with population matrices W (n) ![0,1]n"n. With W(n), define the observed graph Laplacian L(n) as in (1.2). LetL (n) be the population version of L(n) as defined in (1.3). Define !n as in (2.1).

If there exists N > 0, such that ! 2n logn > 2 for all n > N , then

!!L(n)L(n) # L (n)L (n)!!F = o

" logn

! 2nn1/2

#a.s.

Appendix A contains a nonasymptotic bound on $L(n)L(n) # L (n)L (n)$F aswell as the proof of Theorem 2.1. The main condition in this theorem is the lowerbound on !n. This sufficient condition is used to produce Gaussian tail bounds foreach of the Dii and other similar quantities.

For any symmetric matrix M , define "(M) to be the eigenvalues of M and forany interval S % R, define

"S(M) = {"(M) & S}.Further, define "

(n)1 ' · · · ' "

(n)n to be the elements of "(L (n)L (n)) and "

(n)1 '

· · · ' "(n)n to be the elements of "(L(n)L(n)). The eigenvalues of L(n)L(n) converge

in the following sense,

maxi

$$"(n)i # "

(n)i

$$ ( !!L(n)L(n) # L (n)L (n)!!F

(2.2)= o

" logn

! 2nn1/2

#a.s.

This follows from Theorem 2.1, Weyl’s inequality [Bhatia (1987)], and the factthat the Frobenius norm is an upper bound of the spectral norm.

This shows that under certain conditions on !n, the eigenvalues of L(n)L(n)

converge to the eigenvalues of L (n)L (n). In order to study spectral clustering,it is now necessary to show that the eigenvectors also converge. The Davis–Kahantheorem provides a bound for this.

PROPOSITION 2.1 (Davis–Kahan). Let S % R be an interval. Denote X asan orthonormal matrix whose column space is equal to the eigenspace of L Lcorresponding to the eigenvalues in "S(L L ) [more formally, the column space ofX is the image of the spectral projection of L L induced by "S(L L )]. Denote byX the analogous quantity for LL. Define the distance between S and the spectrumof L L outside of S as

# = min{|$ # s|;$ eigenvalue of L L ,$ /! S, s ! S}.

AOS imspdf v.2011/05/05 Prn:2011/05/17; 16:21 F:aos887.tex; (Laima) p. 10

The two sets of clusters in Di-Sim estimate the two partitions in the Directed Stochastic Blockmodel

CLUSTERING FOR THE STOCHASTIC BLOCKMODEL 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

If X and X are of the same dimension, then there is an orthonormal matrix O , thatdepends on X and X, such that

12!X " X O!2

F # !LL " L L !2F

!2 .

The original Davis–Kahan theorem bounds the “canonical angle,” also knownas the “principal angle,” between the column spaces of X and X. Appendix Bexplains how this can be converted into the bound stated above. To understandwhy the orthonormal matrix O is included, imagine the situation that L = L . Inthis case X is not necessarily equal to X . At a minimum, the columns of X couldbe a permuted version of those in X . If there are any eigenvalues with multiplicitygreater than one, these problems could be slightly more involved. The matrix Oremoves these inconveniences and related inconveniences.

The bound in the Davis–Kahan theorem is sensitive to the value !. This reflectsthat when there are eigenvalues of L L close to S, but not inside of S, then asmall perturbation can move these eigenvalues inside of S and drastically alterthe eigenvectors. The next theorem combines the previous results to show that theeigenvectors of L(n) converge to the eigenvectors of L (n). Because it is asymptoticin the number of nodes, it is important to allow S and ! to depend on n. For asequence of open intervals Sn $ R, define

!n = inf!|" " s|;" % #

"L (n)L (n)#," /% Sn, s % Sn

$,(2.3)

!&n = inf

!|" " s|;" % #Sn

"L (n)L (n)#, s /% Sn

$,(2.4)

S&n = {";"2 % Sn}.(2.5)

The quantity !&n is added to measure how well Sn insulates the eigenvalues of

interest. If !&n is too small, then some important empirical eigenvalues might fall

outside of Sn. By restricting the rate at which !n and !&n converge to zero, the next

theorem ensures the dimensions of X and X agree for a large enough n. This isrequired in order to use the Davis–Kahan theorem.

THEOREM 2.2. Define W(n) % {0,1}n'n to be a sequence of growing randomadjacency matrices from the latent space model with population matrices W (n).With W(n), define the observed graph Laplacian L(n) as in (1.2). Let L (n) be thepopulation version of L(n) as defined in (1.3). Define $n as in (2.1). With a sequenceof open intervals Sn $ R, define !n, !&

n and S&n as in (2.3), (2.4) and (2.5).

Let kn = |#S&n(L(n))|, the size of the set #S&

n(L(n)). Define the matrix Xn %

Rn'kn such that its orthonormal columns are the eigenvectors of symmetric ma-trix L(n) corresponding to all the eigenvalues contained in #S&

n(L(n)). For Kn =

|#S&n(L (n))|, define Xn % Rn'Kn to be the analogous matrix for symmetric matrix

L (n) with eigenvalues in #S&n(L (n)).


12 K. ROHE, S. CHATTERJEE AND B. YU

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

Assume that n!1/2(logn)2 = O(min{!n, !"n}). Also assume that there exists pos-

itive integer N such that for all n > N , it follows that " 2n > 2/ logn.

Eventually, kn = Kn. Afterward, for some sequence of orthonormal rota-tions On,

#Xn ! XnOn#F = o

! logn

!n" 2nn1/2

"a.s.

A proof of Theorem 2.2 is in Appendix C. There are two key assumptions inTheorem 2.2:

(1) n!1/2(logn)2 = O(min{!n, !"n}),

(2) " 2n > 2/ logn.

The first assumption ensures that the “eigengap,” the gap between the eigenvaluesof interest and the rest of the eigenvalues, does not converge to zero too quickly.The theorem is most interesting when S includes only the leading eigenvalues.This is because the eigenvectors with the largest eigenvalues have the potential toreveal clusters or other structures in the network. When these leading eigenvaluesare well separated from the smaller eigenvalues, the eigengap is large. The sec-ond assumption ensures that the expected degree of each node grows sufficientlyfast. If "n is constant, then the expected degree of each node grows linearly. Theassumption " 2

n > 2/ logn is almost as restrictive.The usefulness of Theorem 2.2 depends on how well the eigenvectors of L (n)

represent the characteristics of interest in the network. For example, under theStochastic Blockmodel with B full rank, if Sn is chosen so that S"

n contains allnonzero eigenvalues of L (n), then the block structure can be determined from thecolumns of Xn. It can be shown that nodes i and j are in the same block if and onlyif the ith row of Xn equals the j th row. The next section examines how spectralclustering exploits this structure, using Xn to estimate the block structure in theStochastic Blockmodel.

3. The Stochastic Blockmodel. The work of Leskovec et al. (2008) showsthat the sizes of the best clusters are not very large in a diverse set of empiricalnetworks, suggesting that the appropriate asymptotic framework should allow forthe number of communities to grow with the number of nodes. This section showsthat, under suitable conditions, spectral clustering can correctly partition most ofthe nodes in the Stochastic Blockmodel, even when the number of blocks growswith the number of nodes.

The Stochastic Blockmodel, introduced by Holland, Laskey and Leinhardt(1983), is a specific latent space model. Because it has well-defined communi-ties in the model, community detection can be framed as a problem of statisticalestimation. The important assumption of this model is that of stochastic equiva-lence within the blocks; if two nodes i and j are in the same block, rows i and jof W are equal.


The two sets of clusters in Di-Sim estimate the two partitions in the Directed Stochastic Blockmodel

69

DIRECTED SPECTRAL CLUSTERING AND A DIRECTED STOCHASTICBLOCKMODEL

KARL ROHE AND BIN YU

Under the Directed Stochastic Blockmodel with kn blocks, defineL = E(O)−1/2

E(A)E(P )−1/2.

Define σ1 ≥ σ2 ≥ · · · ≥ σkn > 0 as the kn nonzero singular valuesof L . Define

τn = mini=1,...,n

min{E(Pii), E(Oii)}/n.

Assume there exists N such that for all n > N , τ 2n > 2/ log n.

Define Pop = maxj=1,...,k(ZT Z)jj . If n−1/2(log n)2 = O(σkn),then

|M | = o

�Pop(log n)2

nσ2kn

τ 4n

�

1

70

LT L

LLT

Rows and columns ordered byu2 u3 u4 u5

v5v4v3v2

In these images, the color gradient is on a log scale.

left singular vector

right singular vector

Visualizing and for EnronLT L LLT

Documents

Spectral Clustering Algorithms and High-dim Stochastic ...•Stat approach: how does algorithm perform on tractable stochastic models with clearly deﬁned population clusters •Stochastic