Absorbing Random Walk Centrality

Absorbing Random Walk Centrality Theory and Algorithms

1

Harry Mavroforakis, Boston University Michael Mathioudakis, Aalto University

ArisAdes Gionis, Aalto University

Helsinki -‐ September 15th 2015

2

submit query to twiKer

e.g. ‘#ferguson’

want to see messages from

few, central users

many and different groups of people might be posAng

about it

one approach...

represent acAvity with graph

3

users in the results (query nodes)

other users

connecAons

4

select k central query nodes

what is ‘central’? many measures

have been studied

here, we use random walks for

robustness

absorbing random walk centrality

absorbing random walk centrality

5

model start

from query node q w.p. s(q) re-‐start

with probability α at each step transiAons

follow edges at random from one node to another

absorpAon we can designate set of absorbing nodes

no escape from absorbing nodes

centrality of nodes S expected Ame (number of steps) unAl absorbed by S

problem

input graph , query nodes , α & candidate nodes

(e.g. query nodes or all nodes)

select

k candidate nodes with minimum absorpAon Ame

6

7

select nodes that are central w.r.t. the query nodes

k = 1

8


k = 3

9


k ≥|Q|

outline

•  complexity •  greedy algorithm

– naive greedy – speed-‐up

•  heurisAcs & baselines •  empirical evaluaAon

10

complexity

the problem is NP-‐hard reducAon from vertex cover

11

approximaAon

centrality measure monotonicity

absorpAon Ame decreases with more absorbing nodes

supermodularity diminishing returns

12

approximaAon

centrality gain mc: min centrality for k=1 gain = mc -‐ centrality, k>1

non-‐negaAve, non-‐decreasing, submodular

13

S υ {u} S υ {u,v} S

greedy algorithm (1-1/e)-‐approximaAon guarantee for gain

greedy S = empty for i = 1..k

for u in V -‐ S calculate centrality of S υ {u} (*) update S := S υ {best u}

14

boKleneck is in line (*) one matrix inversion (super-‐quadraAc) for step (*) use sherman-morrison to perform (*) in O(n2)

with one (1) inversion for first node

however... sAll O(kn3)

heurisAcs & baselines

•  personalized pagerank with same α •  spectralQ & spectralD

– spectral embedding of nodes – k-‐means on embedding of query nodes – select candidates close to centers – spectral Q selects more nodes from larger clusters

•  spectralC – similar to spectralQ but clustering on candidates

•  degree & distance centrality 15

evaluaAon

16

TABLE I: Dataset statistics

smallDataset |V | |E|karate 34 78dolphins 62 159lesmis 77 254adjnoun 112 425football 115 613

largeDataset |V | |E|kddCoauthors 2 891 2 891livejournal 3 645 4 141ca-GrQc 5 242 14 496ca-HepTh 9 877 25 998roadnet 10 199 13 932oregon-1 11 174 23 409

Degree and distance centrality. Finally, we consider thestandard degree and distance centrality measures.Degree returns the k highest-degree nodes. Note that thisbaseline is oblivious to query nodes Q.Distance returns the k nodes with highest distance centralitywith respect to query nodes Q. The distance centrality of anode u is defined as dc(u) =

⇣Pv2Q d(u, u)

⌘�1.

VII. EXPERIMENTAL EVALUATION

A. Datasets

We evaluate the algorithms described in Section VI on twosets of real graphs: one set of small graphs that allows usto compare the performance of the fast heuristics against thegreedy approach; and one set of larger graphs, to compare theperformance of the heuristics against each other on datasetsof larger scale. Note that the bottleneck of the computationlies in the evaluation of centrality. Even though the techniquewe describe in Section IV-A allows it to scale to datasetsof tens of thousands of nodes on a single processor, it isstill prohibitively expensive for massive graphs. Still, ourexperimentation allows us to discover the traits of the differentalgorithms and understand what performance to anticipatewhen employed on graphs of massive size.

The datasets are listed in Table I. Small graphs are obtainedfrom Mark Newman’s repository,1 larger graphs from SNAP.2For kddCoauthors, livejournal, and roadnet we usesamples of the original datasets. In the interest of repeatability,our code and datasets are made available anonymously.3

B. Evaluation Methodology

Each experiment in our evaluation framework is definedby a graph G, a set of query nodes Q, a set of candidatenodes D, and an algorithm to solve the problem. We evaluateall algorithms presented in Section VI. For the set of candidatenodes D, we consider two cases: we either set it to be equal tothe set of query nodes, i.e. D = Q, or to the set of all nodes,i.e. D = V .

The query nodes Q are selected randomly, using the follow-ing process: First, we select a set S of s seed nodes, uniformlyat random. Then, we select a ball B(v, r) of predeterminedradius r = 2, around each seed v 2 S.4 Finally, from all balls,we select a set of query nodes Q of predetermined size q,

1http://www-personal.umich.edu/%7Emejn/netdata/2http://snap.stanford.edu/data/index.html3https://db.tt/gbCEfIhb4For the planar roadnet dataset we use r = 3.

with q = 10 and q = 20, respectively, for the small and largerdatasets. Selection is done uniformly at random.

Finally, the restart probability ↵ is set to ↵ = 0.15 and thestarting probabilities s are uniform over Q.Implementation. All algorithms are implemented in Pythonwith extensive use of the NetworkX package,5 and run on aIntel Xeon 2.83GHz processor, on a machine of 32GB RAM.

C. Results

Figure 1 shows the centrality scores achieved by differentalgorithms on the small graphs for varying k (note: lower isbetter). Due to space constrains, we show results on three outof the five small datasets we experimented; the results on theother two datasets are similar. We present two settings, on theleft the candidates are all nodes (D = V ), and on the rightthe candidates are only the query nodes (D = Q). We observethat PPR tracks well – and sometimes surpasses – the qualityof solutions returned by Greedy, while Degree and Distance

often come close to that. Spectral algorithms do not performthat well.

Figure 2 is similar to Figure 1, but results on the largerdatasets are shown, not including Greedy. For the case whereall nodes are candidates, PPR typically has the best perfor-mance, followed by Distance, while Degree is unreliable. Thespectral algorithms typically perform worse than PPR.

For the case that only query nodes are candidates allalgorithms demonstrate similar performance, which is mostoften worse than the previous setting. Both observations canbe explained by the fact that the selection is very restricted bythe requirement D = Q, and there is not much flexibility forthe best performing algorithms to produce a better solution.

In terms of running time on the larger graphs, Distance

executes in a few minutes (with observed times between 15seconds to 5 minutes) while all other heuristics execute withinseconds (all observed times were less than 1 minute). Finally,even though Greedy executes within 1-2 seconds for the smalldatasets, it does not scale well for the larger datasets (runningtime is orders of magnitude worse than the heuristics and notincluded in the experiments).

Based on the above, we conclude that PPR offers the besttrade-off of quality versus running time for datasets of at leastmoderate size (more than 10 k nodes).

VIII. CONCLUSIONS

In this paper, we have addressed the problem of findingcentral nodes in a graph with respect to a set of query nodes Q.Our measure is based on absorbing random walks: we seekto compute k nodes that minimize the expected number ofsteps that a random walk will need to reach at (and be“absorbed” by) when it starts from the query nodes. We haveshown that the problem is NP-hard and described an O(kn3

)

greedy algorithm to solve it approximately. Moreover, weexperimented with heuristic algorithms to solve the problemon large graphs. Our results show that in practice personalizedPageRank offers a good combination of quality and speed.

5http://networkx.github.io/

data

cannot run greedy on these

input

graphs from previous datasets

query nodes: planted spheres k spheres (k = 1)

radius ρ (ρ = 2 or 3) s special nodes inside spheres (s = 10 or 20)

17

α = 0.15

small graphs

18 dolphins

small graphs

19 adjnoun

small graphs

20 karate

large graphs

21 oregon

large graphs

22 livejournal

large graphs

23 roadnet

conclusions

complex problem,

greedy algorithm is expensive personalized pagerank is good alternaAve

future work

expansion strategies comparison with more alternaAves

parallel implementaAon 24

The End

25

26

where the inequality comes from the fact that a path in GX

passing from Z and being absorbed by X corresponds to ashorter path in GY being absorbed by Y .

B. Proposition 5

Proposition Let Ci�1 be a set of i � 1 absorbing nodes,Pi�1 the corresponding transition matrix, and Fi�1 = (I �Pi�1)

�1. Let Ci = Ci�1 [ {u}. Given Fi�1, the centralityscore acQ(Ci) can be computed in time O(n2

).

The proof makes use of the following lemma.

Lemma 1 (Sherman-Morrison Formula [7]) Let M be asquare n⇥n invertible matrix and M

�1 its inverse. Moreover,let a and b be any two column vectors of size n. Then, thefollowing equation holds

(M+ ab

T)

�1= M

�1 �M

�1ab

TM

�1/(1 + b

TM

�1a).

Proof: (Proposition 5) Without loss of generality, let theset of absorbing nodes be Ci�1 = {1, 2, . . . , i � 1}. Forsimplicity, assume no self-loops for non-absorbing nodes, i.e.,Pu,u = 0 for u 2 V � Ci�1. As in Section VI, the expectednumber of steps before absorption is given by the formulas

acQ(Ci�1) = s

TQFi�11,

with Fi�1 = A

�1i�1 and Ai�1 = I�Pi�1.

We proceed to show how to increase the set of absorbing nodesby one and calculate the new absorption time by updating Fi�1

in O(n2). Without loss of generality, suppose we add node i

to the absorbing nodes Ci�1, so that

Ci = Ci�1 [ {i} = {1, 2, . . . , i� 1, i}.Let Pi be the transition matrix over G with absorbing nodesCi. Like before, the expected absorption time by nodes Ci isgiven by the formulas

acQ(Ci) = s

TQFi1,

with Fi = A

�1i and Ai = I�Pi.

Notice that

Ai �Ai�1 = (I�Pi)� (I�Pi�1) = Pi�1 �Pi

=

2

40(i�1)⇥n

pi,1 . . . pi,n0(n�i)⇥n

3

5= ab

T

where pi,j denotes the transition probability from node i tonode j in the transition matrix Pi�1, and the column-vectorsa and b are defined as

a = [

i�1z }| {0 . . . 0 1

n�iz }| {0 . . . 0], and

b = [pi,1 . . . pi,n].

By a direct application of Lemma 1, it is easy to see that wecan compute Fi from Fi�1 with the following formula, at acost of O(n2

) operations.

Fi = Fi�1 � (Fi�1a)(bTFi�1)/(1 + b

T(Fi�1a))

We have thus shown that, given Fi�1, we can compute Fi,and therefore acQ(Ci) as well, in O(n2

).

C. Proposition 6

Proposition Let C be a set of absorbing nodes, P thecorresponding transition matrix, and F = (I � P)

�1. LetC 0

= C � {v} [ {u}, for u, v 2 C. Given F, the centralityscore acQ(C 0

) can be computed in time O(n2).

Proof: The proof is similar to the proof of Proposition 5.Again we assume no self-loops for non-absorbing nodes, i.e.,Puu = 0 for u 2 V � C. Without loss of generality, let thetwo sets of absorbing nodes be

C = {1, 2, . . . , i� 1, i}, andC 0

= {1, 2, . . . , i� 1, i+ 1}.Let P0 be the transition matrix with absorbing nodes C 0. Theabsorbing centrality for the two sets of absorbing nodes C andC 0 is expressed as a function of the following two matrices

F = A

�1, with A = I�P, and

F

0= A

0�1, with A

0= (I�P

0).

Notice that

A

0 �A = (I�P

0)� (I�P) = P�P

0

=

2

664

0(i�1)⇥n

�pi,1 . . . � pi,npi+1,0 . . . pi+1,n

0(n�i�1)⇥n

3

775 = a2bT2 � a1b

T1

where pi,j denotes the transition probability from node i tonode j in a transition matrix P0 where neither node i or i+1

is absorbing, and the column-vectors a1, b1, a2, b2 are definedas

a1 = [

i�1z }| {0 . . . 0 1 0

n�i�1z }| {0 . . . 0]

b1 = [pi,1 . . . pi,n]

a2 = [

i�1z }| {0 . . . 0 0 1

n�i�1z }| {0 . . . 0]

b2 = [pi+1,1 . . . pi+1,n].

By an argument similar with the one we made in the proofof Proposition 5, we can compute F

0 in the following twosteps from F, each costing O(n2

) operations for the providedparenthesization

Z = F� (Za2)(bT2 Z)/(1 + b

T2 (Za2)),

F

0= Z+ (Fa1)(b

T1 F)/(1 + b

T1 (Fa1)).

We have thus shown that, given F, we can compute F

0, andtherefore acQ(C

0) as well, in time O(n2

).



B. Proposition 5



).




(M+ ab

T)

�1= M

�1 �M

�1ab

TM

�1/(1 + b

TM

�1a).


acQ(Ci�1) = s

TQFi�11,

with Fi�1 = A

�1i�1 and Ai�1 = I�Pi�1.





acQ(Ci) = s

TQFi1,

with Fi = A


Notice that

Ai �Ai�1 = (I�Pi)� (I�Pi�1) = Pi�1 �Pi

=

2

40(i�1)⇥n

pi,1 . . . pi,n0(n�i)⇥n

3

5= ab

T


a = [

i�1z }| {0 . . . 0 1

n�iz }| {0 . . . 0], and

b = [pi,1 . . . pi,n].


) operations.

Fi = Fi�1 � (Fi�1a)(bTFi�1)/(1 + b

T(Fi�1a))


).

C. Proposition 6


�1. LetC 0




C = {1, 2, . . . , i� 1, i}, andC 0


F = A


F

0= A

0�1, with A

0= (I�P

0).

Notice that

A

0 �A = (I�P

0)� (I�P) = P�P

0

=

2

664

0(i�1)⇥n

�pi,1 . . . � pi,npi+1,0 . . . pi+1,n

0(n�i�1)⇥n

3

775 = a2bT2 � a1b

T1



a1 = [

i�1z }| {0 . . . 0 1 0

n�i�1z }| {0 . . . 0]

b1 = [pi,1 . . . pi,n]

a2 = [

i�1z }| {0 . . . 0 0 1

n�i�1z }| {0 . . . 0]

b2 = [pi+1,1 . . . pi+1,n].




Z = F� (Za2)(bT2 Z)/(1 + b

T2 (Za2)),

F

0= Z+ (Fa1)(b

T1 F)/(1 + b

T1 (Fa1)).




).



B. Proposition 5



).




(M+ ab

T)

�1= M

�1 �M

�1ab

TM

�1/(1 + b

TM

�1a).


acQ(Ci�1) = s

TQFi�11,

with Fi�1 = A

�1i�1 and Ai�1 = I�Pi�1.





acQ(Ci) = s

TQFi1,

with Fi = A


Notice that

Ai �Ai�1 = (I�Pi)� (I�Pi�1) = Pi�1 �Pi

=

2

40(i�1)⇥n

pi,1 . . . pi,n0(n�i)⇥n

3

5= ab

T


a = [

i�1z }| {0 . . . 0 1

n�iz }| {0 . . . 0], and

b = [pi,1 . . . pi,n].


) operations.

Fi = Fi�1 � (Fi�1a)(bTFi�1)/(1 + b

T(Fi�1a))


).

C. Proposition 6


�1. LetC 0




C = {1, 2, . . . , i� 1, i}, andC 0


F = A


F

0= A

0�1, with A

0= (I�P

0).

Notice that

A

0 �A = (I�P

0)� (I�P) = P�P

0

=

2

664

0(i�1)⇥n

�pi,1 . . . � pi,npi+1,0 . . . pi+1,n

0(n�i�1)⇥n

3

775 = a2bT2 � a1b

T1



a1 = [

i�1z }| {0 . . . 0 1 0

n�i�1z }| {0 . . . 0]

b1 = [pi,1 . . . pi,n]

a2 = [

i�1z }| {0 . . . 0 0 1

n�i�1z }| {0 . . . 0]

b2 = [pi+1,1 . . . pi+1,n].




Z = F� (Za2)(bT2 Z)/(1 + b

T2 (Za2)),

F

0= Z+ (Fa1)(b

T1 F)/(1 + b

T1 (Fa1)).




).

Science

Absorbing Random Walk Centrality