26
Absorbing Random Walk Centrality Theory and Algorithms 1 Harry Mavroforakis, Boston University Michael Mathioudakis, Aalto University ArisAdes Gionis, Aalto University Helsinki September 15 th 2015

Absorbing Random Walk Centrality

Embed Size (px)

Citation preview

Page 1: Absorbing Random Walk Centrality

Absorbing  Random  Walk  Centrality  Theory  and  Algorithms  

1  

Harry  Mavroforakis,  Boston  University  Michael  Mathioudakis,  Aalto  University  

ArisAdes  Gionis,  Aalto  University  

Helsinki  -­‐  September  15th  2015  

Page 2: Absorbing Random Walk Centrality

2  

submit  query  to  twiKer  

e.g.  ‘#ferguson’  

want  to  see  messages  from    

few,  central  users  

many  and  different  groups  of  people  might  be  posAng  

about  it  

Page 3: Absorbing Random Walk Centrality

one  approach...    

represent  acAvity  with  graph  

3  

users  in  the  results  (query  nodes)  

other  users  

connecAons  

Page 4: Absorbing Random Walk Centrality

4  

select  k  central  query  nodes  

what  is  ‘central’?  many  measures  

have  been  studied  

here,  we  use  random  walks  for  

robustness  

absorbing  random  walk  centrality  

Page 5: Absorbing Random Walk Centrality

absorbing  random  walk  centrality  

5  

model  start  

from  query  node  q  w.p.  s(q)  re-­‐start  

with  probability  α  at  each  step  transiAons  

follow  edges  at  random  from  one  node  to  another    

absorpAon  we  can  designate  set  of  absorbing  nodes  

no  escape  from  absorbing  nodes    

centrality  of  nodes  S  expected  Ame  (number  of  steps)  unAl  absorbed  by  S  

Page 6: Absorbing Random Walk Centrality

problem  

input  graph  ,  query  nodes  ,  α  &  candidate  nodes  

(e.g.  query  nodes  or  all  nodes)  

 select    

k  candidate  nodes  with  minimum  absorpAon  Ame  

6  

Page 7: Absorbing Random Walk Centrality

7  

select  nodes  that  are  central  w.r.t.    the  query  nodes  

k  =  1  

Page 8: Absorbing Random Walk Centrality

8  

select  nodes  that  are  central  w.r.t.    the  query  nodes  

k  =  3  

Page 9: Absorbing Random Walk Centrality

9  

select  nodes  that  are  central  w.r.t.    the  query  nodes  

k  ≥|Q|  

Page 10: Absorbing Random Walk Centrality

outline  

•  complexity  •  greedy  algorithm  

– naive  greedy  – speed-­‐up  

•  heurisAcs  &  baselines  •  empirical  evaluaAon    

10  

Page 11: Absorbing Random Walk Centrality

complexity  

the  problem  is  NP-­‐hard  reducAon  from    vertex cover

11  

Page 12: Absorbing Random Walk Centrality

approximaAon  

centrality  measure  monotonicity  

absorpAon  Ame  decreases  with  more  absorbing  nodes  

supermodularity  diminishing  returns  

12  

Page 13: Absorbing Random Walk Centrality

approximaAon  

centrality  gain  mc:  min  centrality  for  k=1  gain  =  mc  -­‐  centrality,  k>1  

non-­‐negaAve,  non-­‐decreasing,  submodular  

13  

S  υ {u}     S  υ  {u,v}  S  

greedy  algorithm  (1-1/e)-­‐approximaAon  guarantee  for  gain  

Page 14: Absorbing Random Walk Centrality

greedy  S  =  empty  for  i  =  1..k  

 for  u  in  V  -­‐  S      calculate  centrality  of  S  υ {u}  (*)    update  S  :=  S  υ {best  u}  

14  

boKleneck  is  in  line  (*)  one  matrix  inversion    (super-­‐quadraAc)  for  step  (*)  use  sherman-morrison  to  perform  (*)  in  O(n2)  

with  one  (1)  inversion  for  first  node    

however...  sAll  O(kn3)  

Page 15: Absorbing Random Walk Centrality

heurisAcs  &  baselines  

•  personalized  pagerank  with  same  α  •  spectralQ  &  spectralD  

– spectral  embedding  of  nodes  – k-­‐means  on  embedding  of  query  nodes  – select  candidates  close  to  centers  – spectral  Q  selects  more  nodes  from  larger  clusters  

•  spectralC  – similar  to  spectralQ  but  clustering  on  candidates  

•  degree  &  distance  centrality  15  

Page 16: Absorbing Random Walk Centrality

evaluaAon  

16  

TABLE I: Dataset statistics

smallDataset |V | |E|karate 34 78dolphins 62 159lesmis 77 254adjnoun 112 425football 115 613

largeDataset |V | |E|kddCoauthors 2 891 2 891livejournal 3 645 4 141ca-GrQc 5 242 14 496ca-HepTh 9 877 25 998roadnet 10 199 13 932oregon-1 11 174 23 409

Degree and distance centrality. Finally, we consider thestandard degree and distance centrality measures.Degree returns the k highest-degree nodes. Note that thisbaseline is oblivious to query nodes Q.Distance returns the k nodes with highest distance centralitywith respect to query nodes Q. The distance centrality of anode u is defined as dc(u) =

⇣Pv2Q d(u, u)

⌘�1.

VII. EXPERIMENTAL EVALUATION

A. Datasets

We evaluate the algorithms described in Section VI on twosets of real graphs: one set of small graphs that allows usto compare the performance of the fast heuristics against thegreedy approach; and one set of larger graphs, to compare theperformance of the heuristics against each other on datasetsof larger scale. Note that the bottleneck of the computationlies in the evaluation of centrality. Even though the techniquewe describe in Section IV-A allows it to scale to datasetsof tens of thousands of nodes on a single processor, it isstill prohibitively expensive for massive graphs. Still, ourexperimentation allows us to discover the traits of the differentalgorithms and understand what performance to anticipatewhen employed on graphs of massive size.

The datasets are listed in Table I. Small graphs are obtainedfrom Mark Newman’s repository,1 larger graphs from SNAP.2For kddCoauthors, livejournal, and roadnet we usesamples of the original datasets. In the interest of repeatability,our code and datasets are made available anonymously.3

B. Evaluation Methodology

Each experiment in our evaluation framework is definedby a graph G, a set of query nodes Q, a set of candidatenodes D, and an algorithm to solve the problem. We evaluateall algorithms presented in Section VI. For the set of candidatenodes D, we consider two cases: we either set it to be equal tothe set of query nodes, i.e. D = Q, or to the set of all nodes,i.e. D = V .

The query nodes Q are selected randomly, using the follow-ing process: First, we select a set S of s seed nodes, uniformlyat random. Then, we select a ball B(v, r) of predeterminedradius r = 2, around each seed v 2 S.4 Finally, from all balls,we select a set of query nodes Q of predetermined size q,

1http://www-personal.umich.edu/%7Emejn/netdata/2http://snap.stanford.edu/data/index.html3https://db.tt/gbCEfIhb4For the planar roadnet dataset we use r = 3.

with q = 10 and q = 20, respectively, for the small and largerdatasets. Selection is done uniformly at random.

Finally, the restart probability ↵ is set to ↵ = 0.15 and thestarting probabilities s are uniform over Q.Implementation. All algorithms are implemented in Pythonwith extensive use of the NetworkX package,5 and run on aIntel Xeon 2.83GHz processor, on a machine of 32GB RAM.

C. Results

Figure 1 shows the centrality scores achieved by differentalgorithms on the small graphs for varying k (note: lower isbetter). Due to space constrains, we show results on three outof the five small datasets we experimented; the results on theother two datasets are similar. We present two settings, on theleft the candidates are all nodes (D = V ), and on the rightthe candidates are only the query nodes (D = Q). We observethat PPR tracks well – and sometimes surpasses – the qualityof solutions returned by Greedy, while Degree and Distance

often come close to that. Spectral algorithms do not performthat well.

Figure 2 is similar to Figure 1, but results on the largerdatasets are shown, not including Greedy. For the case whereall nodes are candidates, PPR typically has the best perfor-mance, followed by Distance, while Degree is unreliable. Thespectral algorithms typically perform worse than PPR.

For the case that only query nodes are candidates allalgorithms demonstrate similar performance, which is mostoften worse than the previous setting. Both observations canbe explained by the fact that the selection is very restricted bythe requirement D = Q, and there is not much flexibility forthe best performing algorithms to produce a better solution.

In terms of running time on the larger graphs, Distance

executes in a few minutes (with observed times between 15seconds to 5 minutes) while all other heuristics execute withinseconds (all observed times were less than 1 minute). Finally,even though Greedy executes within 1-2 seconds for the smalldatasets, it does not scale well for the larger datasets (runningtime is orders of magnitude worse than the heuristics and notincluded in the experiments).

Based on the above, we conclude that PPR offers the besttrade-off of quality versus running time for datasets of at leastmoderate size (more than 10 k nodes).

VIII. CONCLUSIONS

In this paper, we have addressed the problem of findingcentral nodes in a graph with respect to a set of query nodes Q.Our measure is based on absorbing random walks: we seekto compute k nodes that minimize the expected number ofsteps that a random walk will need to reach at (and be“absorbed” by) when it starts from the query nodes. We haveshown that the problem is NP-hard and described an O(kn3

)

greedy algorithm to solve it approximately. Moreover, weexperimented with heuristic algorithms to solve the problemon large graphs. Our results show that in practice personalizedPageRank offers a good combination of quality and speed.

5http://networkx.github.io/

data  

cannot  run  greedy  on  these  

Page 17: Absorbing Random Walk Centrality

input  

graphs  from  previous  datasets    

query  nodes:  planted  spheres  k  spheres  (k  =  1)  

radius  ρ  (ρ  =  2  or  3)  s  special  nodes  inside  spheres  (s  =  10  or  20)  

17  

α  =  0.15  

Page 18: Absorbing Random Walk Centrality

small  graphs  

18  dolphins  

Page 19: Absorbing Random Walk Centrality

small  graphs  

19  adjnoun  

Page 20: Absorbing Random Walk Centrality

small  graphs  

20  karate  

Page 21: Absorbing Random Walk Centrality

large  graphs  

21  oregon  

Page 22: Absorbing Random Walk Centrality

large  graphs  

22  livejournal  

Page 23: Absorbing Random Walk Centrality

large  graphs  

23  roadnet  

Page 24: Absorbing Random Walk Centrality

conclusions  

 complex  problem,  

greedy  algorithm  is  expensive  personalized  pagerank  is  good  alternaAve  

 future  work  

expansion  strategies  comparison  with  more  alternaAves  

parallel  implementaAon       24  

Page 25: Absorbing Random Walk Centrality

The End

25  

Page 26: Absorbing Random Walk Centrality

26  

where the inequality comes from the fact that a path in GX

passing from Z and being absorbed by X corresponds to ashorter path in GY being absorbed by Y .

B. Proposition 5

Proposition Let Ci�1 be a set of i � 1 absorbing nodes,Pi�1 the corresponding transition matrix, and Fi�1 = (I �Pi�1)

�1. Let Ci = Ci�1 [ {u}. Given Fi�1, the centralityscore acQ(Ci) can be computed in time O(n2

).

The proof makes use of the following lemma.

Lemma 1 (Sherman-Morrison Formula [7]) Let M be asquare n⇥n invertible matrix and M

�1 its inverse. Moreover,let a and b be any two column vectors of size n. Then, thefollowing equation holds

(M+ ab

T)

�1= M

�1 �M

�1ab

TM

�1/(1 + b

TM

�1a).

Proof: (Proposition 5) Without loss of generality, let theset of absorbing nodes be Ci�1 = {1, 2, . . . , i � 1}. Forsimplicity, assume no self-loops for non-absorbing nodes, i.e.,Pu,u = 0 for u 2 V � Ci�1. As in Section VI, the expectednumber of steps before absorption is given by the formulas

acQ(Ci�1) = s

TQFi�11,

with Fi�1 = A

�1i�1 and Ai�1 = I�Pi�1.

We proceed to show how to increase the set of absorbing nodesby one and calculate the new absorption time by updating Fi�1

in O(n2). Without loss of generality, suppose we add node i

to the absorbing nodes Ci�1, so that

Ci = Ci�1 [ {i} = {1, 2, . . . , i� 1, i}.Let Pi be the transition matrix over G with absorbing nodesCi. Like before, the expected absorption time by nodes Ci isgiven by the formulas

acQ(Ci) = s

TQFi1,

with Fi = A

�1i and Ai = I�Pi.

Notice that

Ai �Ai�1 = (I�Pi)� (I�Pi�1) = Pi�1 �Pi

=

2

40(i�1)⇥n

pi,1 . . . pi,n0(n�i)⇥n

3

5= ab

T

where pi,j denotes the transition probability from node i tonode j in the transition matrix Pi�1, and the column-vectorsa and b are defined as

a = [

i�1z }| {0 . . . 0 1

n�iz }| {0 . . . 0], and

b = [pi,1 . . . pi,n].

By a direct application of Lemma 1, it is easy to see that wecan compute Fi from Fi�1 with the following formula, at acost of O(n2

) operations.

Fi = Fi�1 � (Fi�1a)(bTFi�1)/(1 + b

T(Fi�1a))

We have thus shown that, given Fi�1, we can compute Fi,and therefore acQ(Ci) as well, in O(n2

).

C. Proposition 6

Proposition Let C be a set of absorbing nodes, P thecorresponding transition matrix, and F = (I � P)

�1. LetC 0

= C � {v} [ {u}, for u, v 2 C. Given F, the centralityscore acQ(C 0

) can be computed in time O(n2).

Proof: The proof is similar to the proof of Proposition 5.Again we assume no self-loops for non-absorbing nodes, i.e.,Puu = 0 for u 2 V � C. Without loss of generality, let thetwo sets of absorbing nodes be

C = {1, 2, . . . , i� 1, i}, andC 0

= {1, 2, . . . , i� 1, i+ 1}.Let P0 be the transition matrix with absorbing nodes C 0. Theabsorbing centrality for the two sets of absorbing nodes C andC 0 is expressed as a function of the following two matrices

F = A

�1, with A = I�P, and

F

0= A

0�1, with A

0= (I�P

0).

Notice that

A

0 �A = (I�P

0)� (I�P) = P�P

0

=

2

664

0(i�1)⇥n

�pi,1 . . . � pi,npi+1,0 . . . pi+1,n

0(n�i�1)⇥n

3

775 = a2bT2 � a1b

T1

where pi,j denotes the transition probability from node i tonode j in a transition matrix P0 where neither node i or i+1

is absorbing, and the column-vectors a1, b1, a2, b2 are definedas

a1 = [

i�1z }| {0 . . . 0 1 0

n�i�1z }| {0 . . . 0]

b1 = [pi,1 . . . pi,n]

a2 = [

i�1z }| {0 . . . 0 0 1

n�i�1z }| {0 . . . 0]

b2 = [pi+1,1 . . . pi+1,n].

By an argument similar with the one we made in the proofof Proposition 5, we can compute F

0 in the following twosteps from F, each costing O(n2

) operations for the providedparenthesization

Z = F� (Za2)(bT2 Z)/(1 + b

T2 (Za2)),

F

0= Z+ (Fa1)(b

T1 F)/(1 + b

T1 (Fa1)).

We have thus shown that, given F, we can compute F

0, andtherefore acQ(C

0) as well, in time O(n2

).

where the inequality comes from the fact that a path in GX

passing from Z and being absorbed by X corresponds to ashorter path in GY being absorbed by Y .

B. Proposition 5

Proposition Let Ci�1 be a set of i � 1 absorbing nodes,Pi�1 the corresponding transition matrix, and Fi�1 = (I �Pi�1)

�1. Let Ci = Ci�1 [ {u}. Given Fi�1, the centralityscore acQ(Ci) can be computed in time O(n2

).

The proof makes use of the following lemma.

Lemma 1 (Sherman-Morrison Formula [7]) Let M be asquare n⇥n invertible matrix and M

�1 its inverse. Moreover,let a and b be any two column vectors of size n. Then, thefollowing equation holds

(M+ ab

T)

�1= M

�1 �M

�1ab

TM

�1/(1 + b

TM

�1a).

Proof: (Proposition 5) Without loss of generality, let theset of absorbing nodes be Ci�1 = {1, 2, . . . , i � 1}. Forsimplicity, assume no self-loops for non-absorbing nodes, i.e.,Pu,u = 0 for u 2 V � Ci�1. As in Section VI, the expectednumber of steps before absorption is given by the formulas

acQ(Ci�1) = s

TQFi�11,

with Fi�1 = A

�1i�1 and Ai�1 = I�Pi�1.

We proceed to show how to increase the set of absorbing nodesby one and calculate the new absorption time by updating Fi�1

in O(n2). Without loss of generality, suppose we add node i

to the absorbing nodes Ci�1, so that

Ci = Ci�1 [ {i} = {1, 2, . . . , i� 1, i}.Let Pi be the transition matrix over G with absorbing nodesCi. Like before, the expected absorption time by nodes Ci isgiven by the formulas

acQ(Ci) = s

TQFi1,

with Fi = A

�1i and Ai = I�Pi.

Notice that

Ai �Ai�1 = (I�Pi)� (I�Pi�1) = Pi�1 �Pi

=

2

40(i�1)⇥n

pi,1 . . . pi,n0(n�i)⇥n

3

5= ab

T

where pi,j denotes the transition probability from node i tonode j in the transition matrix Pi�1, and the column-vectorsa and b are defined as

a = [

i�1z }| {0 . . . 0 1

n�iz }| {0 . . . 0], and

b = [pi,1 . . . pi,n].

By a direct application of Lemma 1, it is easy to see that wecan compute Fi from Fi�1 with the following formula, at acost of O(n2

) operations.

Fi = Fi�1 � (Fi�1a)(bTFi�1)/(1 + b

T(Fi�1a))

We have thus shown that, given Fi�1, we can compute Fi,and therefore acQ(Ci) as well, in O(n2

).

C. Proposition 6

Proposition Let C be a set of absorbing nodes, P thecorresponding transition matrix, and F = (I � P)

�1. LetC 0

= C � {v} [ {u}, for u, v 2 C. Given F, the centralityscore acQ(C 0

) can be computed in time O(n2).

Proof: The proof is similar to the proof of Proposition 5.Again we assume no self-loops for non-absorbing nodes, i.e.,Puu = 0 for u 2 V � C. Without loss of generality, let thetwo sets of absorbing nodes be

C = {1, 2, . . . , i� 1, i}, andC 0

= {1, 2, . . . , i� 1, i+ 1}.Let P0 be the transition matrix with absorbing nodes C 0. Theabsorbing centrality for the two sets of absorbing nodes C andC 0 is expressed as a function of the following two matrices

F = A

�1, with A = I�P, and

F

0= A

0�1, with A

0= (I�P

0).

Notice that

A

0 �A = (I�P

0)� (I�P) = P�P

0

=

2

664

0(i�1)⇥n

�pi,1 . . . � pi,npi+1,0 . . . pi+1,n

0(n�i�1)⇥n

3

775 = a2bT2 � a1b

T1

where pi,j denotes the transition probability from node i tonode j in a transition matrix P0 where neither node i or i+1

is absorbing, and the column-vectors a1, b1, a2, b2 are definedas

a1 = [

i�1z }| {0 . . . 0 1 0

n�i�1z }| {0 . . . 0]

b1 = [pi,1 . . . pi,n]

a2 = [

i�1z }| {0 . . . 0 0 1

n�i�1z }| {0 . . . 0]

b2 = [pi+1,1 . . . pi+1,n].

By an argument similar with the one we made in the proofof Proposition 5, we can compute F

0 in the following twosteps from F, each costing O(n2

) operations for the providedparenthesization

Z = F� (Za2)(bT2 Z)/(1 + b

T2 (Za2)),

F

0= Z+ (Fa1)(b

T1 F)/(1 + b

T1 (Fa1)).

We have thus shown that, given F, we can compute F

0, andtherefore acQ(C

0) as well, in time O(n2

).

where the inequality comes from the fact that a path in GX

passing from Z and being absorbed by X corresponds to ashorter path in GY being absorbed by Y .

B. Proposition 5

Proposition Let Ci�1 be a set of i � 1 absorbing nodes,Pi�1 the corresponding transition matrix, and Fi�1 = (I �Pi�1)

�1. Let Ci = Ci�1 [ {u}. Given Fi�1, the centralityscore acQ(Ci) can be computed in time O(n2

).

The proof makes use of the following lemma.

Lemma 1 (Sherman-Morrison Formula [7]) Let M be asquare n⇥n invertible matrix and M

�1 its inverse. Moreover,let a and b be any two column vectors of size n. Then, thefollowing equation holds

(M+ ab

T)

�1= M

�1 �M

�1ab

TM

�1/(1 + b

TM

�1a).

Proof: (Proposition 5) Without loss of generality, let theset of absorbing nodes be Ci�1 = {1, 2, . . . , i � 1}. Forsimplicity, assume no self-loops for non-absorbing nodes, i.e.,Pu,u = 0 for u 2 V � Ci�1. As in Section VI, the expectednumber of steps before absorption is given by the formulas

acQ(Ci�1) = s

TQFi�11,

with Fi�1 = A

�1i�1 and Ai�1 = I�Pi�1.

We proceed to show how to increase the set of absorbing nodesby one and calculate the new absorption time by updating Fi�1

in O(n2). Without loss of generality, suppose we add node i

to the absorbing nodes Ci�1, so that

Ci = Ci�1 [ {i} = {1, 2, . . . , i� 1, i}.Let Pi be the transition matrix over G with absorbing nodesCi. Like before, the expected absorption time by nodes Ci isgiven by the formulas

acQ(Ci) = s

TQFi1,

with Fi = A

�1i and Ai = I�Pi.

Notice that

Ai �Ai�1 = (I�Pi)� (I�Pi�1) = Pi�1 �Pi

=

2

40(i�1)⇥n

pi,1 . . . pi,n0(n�i)⇥n

3

5= ab

T

where pi,j denotes the transition probability from node i tonode j in the transition matrix Pi�1, and the column-vectorsa and b are defined as

a = [

i�1z }| {0 . . . 0 1

n�iz }| {0 . . . 0], and

b = [pi,1 . . . pi,n].

By a direct application of Lemma 1, it is easy to see that wecan compute Fi from Fi�1 with the following formula, at acost of O(n2

) operations.

Fi = Fi�1 � (Fi�1a)(bTFi�1)/(1 + b

T(Fi�1a))

We have thus shown that, given Fi�1, we can compute Fi,and therefore acQ(Ci) as well, in O(n2

).

C. Proposition 6

Proposition Let C be a set of absorbing nodes, P thecorresponding transition matrix, and F = (I � P)

�1. LetC 0

= C � {v} [ {u}, for u, v 2 C. Given F, the centralityscore acQ(C 0

) can be computed in time O(n2).

Proof: The proof is similar to the proof of Proposition 5.Again we assume no self-loops for non-absorbing nodes, i.e.,Puu = 0 for u 2 V � C. Without loss of generality, let thetwo sets of absorbing nodes be

C = {1, 2, . . . , i� 1, i}, andC 0

= {1, 2, . . . , i� 1, i+ 1}.Let P0 be the transition matrix with absorbing nodes C 0. Theabsorbing centrality for the two sets of absorbing nodes C andC 0 is expressed as a function of the following two matrices

F = A

�1, with A = I�P, and

F

0= A

0�1, with A

0= (I�P

0).

Notice that

A

0 �A = (I�P

0)� (I�P) = P�P

0

=

2

664

0(i�1)⇥n

�pi,1 . . . � pi,npi+1,0 . . . pi+1,n

0(n�i�1)⇥n

3

775 = a2bT2 � a1b

T1

where pi,j denotes the transition probability from node i tonode j in a transition matrix P0 where neither node i or i+1

is absorbing, and the column-vectors a1, b1, a2, b2 are definedas

a1 = [

i�1z }| {0 . . . 0 1 0

n�i�1z }| {0 . . . 0]

b1 = [pi,1 . . . pi,n]

a2 = [

i�1z }| {0 . . . 0 0 1

n�i�1z }| {0 . . . 0]

b2 = [pi+1,1 . . . pi+1,n].

By an argument similar with the one we made in the proofof Proposition 5, we can compute F

0 in the following twosteps from F, each costing O(n2

) operations for the providedparenthesization

Z = F� (Za2)(bT2 Z)/(1 + b

T2 (Za2)),

F

0= Z+ (Fa1)(b

T1 F)/(1 + b

T1 (Fa1)).

We have thus shown that, given F, we can compute F

0, andtherefore acQ(C

0) as well, in time O(n2

).