Upload
michael-mathioudakis
View
199
Download
1
Embed Size (px)
Citation preview
Absorbing Random Walk Centrality Theory and Algorithms
1
Harry Mavroforakis, Boston University Michael Mathioudakis, Aalto University
ArisAdes Gionis, Aalto University
Helsinki -‐ September 15th 2015
2
submit query to twiKer
e.g. ‘#ferguson’
want to see messages from
few, central users
many and different groups of people might be posAng
about it
one approach...
represent acAvity with graph
3
users in the results (query nodes)
other users
connecAons
4
select k central query nodes
what is ‘central’? many measures
have been studied
here, we use random walks for
robustness
absorbing random walk centrality
absorbing random walk centrality
5
model start
from query node q w.p. s(q) re-‐start
with probability α at each step transiAons
follow edges at random from one node to another
absorpAon we can designate set of absorbing nodes
no escape from absorbing nodes
centrality of nodes S expected Ame (number of steps) unAl absorbed by S
problem
input graph , query nodes , α & candidate nodes
(e.g. query nodes or all nodes)
select
k candidate nodes with minimum absorpAon Ame
6
7
select nodes that are central w.r.t. the query nodes
k = 1
8
select nodes that are central w.r.t. the query nodes
k = 3
9
select nodes that are central w.r.t. the query nodes
k ≥|Q|
outline
• complexity • greedy algorithm
– naive greedy – speed-‐up
• heurisAcs & baselines • empirical evaluaAon
10
complexity
the problem is NP-‐hard reducAon from vertex cover
11
approximaAon
centrality measure monotonicity
absorpAon Ame decreases with more absorbing nodes
supermodularity diminishing returns
12
approximaAon
centrality gain mc: min centrality for k=1 gain = mc -‐ centrality, k>1
non-‐negaAve, non-‐decreasing, submodular
13
S υ {u} S υ {u,v} S
greedy algorithm (1-1/e)-‐approximaAon guarantee for gain
greedy S = empty for i = 1..k
for u in V -‐ S calculate centrality of S υ {u} (*) update S := S υ {best u}
14
boKleneck is in line (*) one matrix inversion (super-‐quadraAc) for step (*) use sherman-morrison to perform (*) in O(n2)
with one (1) inversion for first node
however... sAll O(kn3)
heurisAcs & baselines
• personalized pagerank with same α • spectralQ & spectralD
– spectral embedding of nodes – k-‐means on embedding of query nodes – select candidates close to centers – spectral Q selects more nodes from larger clusters
• spectralC – similar to spectralQ but clustering on candidates
• degree & distance centrality 15
evaluaAon
16
TABLE I: Dataset statistics
smallDataset |V | |E|karate 34 78dolphins 62 159lesmis 77 254adjnoun 112 425football 115 613
largeDataset |V | |E|kddCoauthors 2 891 2 891livejournal 3 645 4 141ca-GrQc 5 242 14 496ca-HepTh 9 877 25 998roadnet 10 199 13 932oregon-1 11 174 23 409
Degree and distance centrality. Finally, we consider thestandard degree and distance centrality measures.Degree returns the k highest-degree nodes. Note that thisbaseline is oblivious to query nodes Q.Distance returns the k nodes with highest distance centralitywith respect to query nodes Q. The distance centrality of anode u is defined as dc(u) =
⇣Pv2Q d(u, u)
⌘�1.
VII. EXPERIMENTAL EVALUATION
A. Datasets
We evaluate the algorithms described in Section VI on twosets of real graphs: one set of small graphs that allows usto compare the performance of the fast heuristics against thegreedy approach; and one set of larger graphs, to compare theperformance of the heuristics against each other on datasetsof larger scale. Note that the bottleneck of the computationlies in the evaluation of centrality. Even though the techniquewe describe in Section IV-A allows it to scale to datasetsof tens of thousands of nodes on a single processor, it isstill prohibitively expensive for massive graphs. Still, ourexperimentation allows us to discover the traits of the differentalgorithms and understand what performance to anticipatewhen employed on graphs of massive size.
The datasets are listed in Table I. Small graphs are obtainedfrom Mark Newman’s repository,1 larger graphs from SNAP.2For kddCoauthors, livejournal, and roadnet we usesamples of the original datasets. In the interest of repeatability,our code and datasets are made available anonymously.3
B. Evaluation Methodology
Each experiment in our evaluation framework is definedby a graph G, a set of query nodes Q, a set of candidatenodes D, and an algorithm to solve the problem. We evaluateall algorithms presented in Section VI. For the set of candidatenodes D, we consider two cases: we either set it to be equal tothe set of query nodes, i.e. D = Q, or to the set of all nodes,i.e. D = V .
The query nodes Q are selected randomly, using the follow-ing process: First, we select a set S of s seed nodes, uniformlyat random. Then, we select a ball B(v, r) of predeterminedradius r = 2, around each seed v 2 S.4 Finally, from all balls,we select a set of query nodes Q of predetermined size q,
1http://www-personal.umich.edu/%7Emejn/netdata/2http://snap.stanford.edu/data/index.html3https://db.tt/gbCEfIhb4For the planar roadnet dataset we use r = 3.
with q = 10 and q = 20, respectively, for the small and largerdatasets. Selection is done uniformly at random.
Finally, the restart probability ↵ is set to ↵ = 0.15 and thestarting probabilities s are uniform over Q.Implementation. All algorithms are implemented in Pythonwith extensive use of the NetworkX package,5 and run on aIntel Xeon 2.83GHz processor, on a machine of 32GB RAM.
C. Results
Figure 1 shows the centrality scores achieved by differentalgorithms on the small graphs for varying k (note: lower isbetter). Due to space constrains, we show results on three outof the five small datasets we experimented; the results on theother two datasets are similar. We present two settings, on theleft the candidates are all nodes (D = V ), and on the rightthe candidates are only the query nodes (D = Q). We observethat PPR tracks well – and sometimes surpasses – the qualityof solutions returned by Greedy, while Degree and Distance
often come close to that. Spectral algorithms do not performthat well.
Figure 2 is similar to Figure 1, but results on the largerdatasets are shown, not including Greedy. For the case whereall nodes are candidates, PPR typically has the best perfor-mance, followed by Distance, while Degree is unreliable. Thespectral algorithms typically perform worse than PPR.
For the case that only query nodes are candidates allalgorithms demonstrate similar performance, which is mostoften worse than the previous setting. Both observations canbe explained by the fact that the selection is very restricted bythe requirement D = Q, and there is not much flexibility forthe best performing algorithms to produce a better solution.
In terms of running time on the larger graphs, Distance
executes in a few minutes (with observed times between 15seconds to 5 minutes) while all other heuristics execute withinseconds (all observed times were less than 1 minute). Finally,even though Greedy executes within 1-2 seconds for the smalldatasets, it does not scale well for the larger datasets (runningtime is orders of magnitude worse than the heuristics and notincluded in the experiments).
Based on the above, we conclude that PPR offers the besttrade-off of quality versus running time for datasets of at leastmoderate size (more than 10 k nodes).
VIII. CONCLUSIONS
In this paper, we have addressed the problem of findingcentral nodes in a graph with respect to a set of query nodes Q.Our measure is based on absorbing random walks: we seekto compute k nodes that minimize the expected number ofsteps that a random walk will need to reach at (and be“absorbed” by) when it starts from the query nodes. We haveshown that the problem is NP-hard and described an O(kn3
)
greedy algorithm to solve it approximately. Moreover, weexperimented with heuristic algorithms to solve the problemon large graphs. Our results show that in practice personalizedPageRank offers a good combination of quality and speed.
5http://networkx.github.io/
data
cannot run greedy on these
input
graphs from previous datasets
query nodes: planted spheres k spheres (k = 1)
radius ρ (ρ = 2 or 3) s special nodes inside spheres (s = 10 or 20)
17
α = 0.15
small graphs
18 dolphins
small graphs
19 adjnoun
small graphs
20 karate
large graphs
21 oregon
large graphs
22 livejournal
large graphs
23 roadnet
conclusions
complex problem,
greedy algorithm is expensive personalized pagerank is good alternaAve
future work
expansion strategies comparison with more alternaAves
parallel implementaAon 24
The End
25
26
where the inequality comes from the fact that a path in GX
passing from Z and being absorbed by X corresponds to ashorter path in GY being absorbed by Y .
B. Proposition 5
Proposition Let Ci�1 be a set of i � 1 absorbing nodes,Pi�1 the corresponding transition matrix, and Fi�1 = (I �Pi�1)
�1. Let Ci = Ci�1 [ {u}. Given Fi�1, the centralityscore acQ(Ci) can be computed in time O(n2
).
The proof makes use of the following lemma.
Lemma 1 (Sherman-Morrison Formula [7]) Let M be asquare n⇥n invertible matrix and M
�1 its inverse. Moreover,let a and b be any two column vectors of size n. Then, thefollowing equation holds
(M+ ab
T)
�1= M
�1 �M
�1ab
TM
�1/(1 + b
TM
�1a).
Proof: (Proposition 5) Without loss of generality, let theset of absorbing nodes be Ci�1 = {1, 2, . . . , i � 1}. Forsimplicity, assume no self-loops for non-absorbing nodes, i.e.,Pu,u = 0 for u 2 V � Ci�1. As in Section VI, the expectednumber of steps before absorption is given by the formulas
acQ(Ci�1) = s
TQFi�11,
with Fi�1 = A
�1i�1 and Ai�1 = I�Pi�1.
We proceed to show how to increase the set of absorbing nodesby one and calculate the new absorption time by updating Fi�1
in O(n2). Without loss of generality, suppose we add node i
to the absorbing nodes Ci�1, so that
Ci = Ci�1 [ {i} = {1, 2, . . . , i� 1, i}.Let Pi be the transition matrix over G with absorbing nodesCi. Like before, the expected absorption time by nodes Ci isgiven by the formulas
acQ(Ci) = s
TQFi1,
with Fi = A
�1i and Ai = I�Pi.
Notice that
Ai �Ai�1 = (I�Pi)� (I�Pi�1) = Pi�1 �Pi
=
2
40(i�1)⇥n
pi,1 . . . pi,n0(n�i)⇥n
3
5= ab
T
where pi,j denotes the transition probability from node i tonode j in the transition matrix Pi�1, and the column-vectorsa and b are defined as
a = [
i�1z }| {0 . . . 0 1
n�iz }| {0 . . . 0], and
b = [pi,1 . . . pi,n].
By a direct application of Lemma 1, it is easy to see that wecan compute Fi from Fi�1 with the following formula, at acost of O(n2
) operations.
Fi = Fi�1 � (Fi�1a)(bTFi�1)/(1 + b
T(Fi�1a))
We have thus shown that, given Fi�1, we can compute Fi,and therefore acQ(Ci) as well, in O(n2
).
C. Proposition 6
Proposition Let C be a set of absorbing nodes, P thecorresponding transition matrix, and F = (I � P)
�1. LetC 0
= C � {v} [ {u}, for u, v 2 C. Given F, the centralityscore acQ(C 0
) can be computed in time O(n2).
Proof: The proof is similar to the proof of Proposition 5.Again we assume no self-loops for non-absorbing nodes, i.e.,Puu = 0 for u 2 V � C. Without loss of generality, let thetwo sets of absorbing nodes be
C = {1, 2, . . . , i� 1, i}, andC 0
= {1, 2, . . . , i� 1, i+ 1}.Let P0 be the transition matrix with absorbing nodes C 0. Theabsorbing centrality for the two sets of absorbing nodes C andC 0 is expressed as a function of the following two matrices
F = A
�1, with A = I�P, and
F
0= A
0�1, with A
0= (I�P
0).
Notice that
A
0 �A = (I�P
0)� (I�P) = P�P
0
=
2
664
0(i�1)⇥n
�pi,1 . . . � pi,npi+1,0 . . . pi+1,n
0(n�i�1)⇥n
3
775 = a2bT2 � a1b
T1
where pi,j denotes the transition probability from node i tonode j in a transition matrix P0 where neither node i or i+1
is absorbing, and the column-vectors a1, b1, a2, b2 are definedas
a1 = [
i�1z }| {0 . . . 0 1 0
n�i�1z }| {0 . . . 0]
b1 = [pi,1 . . . pi,n]
a2 = [
i�1z }| {0 . . . 0 0 1
n�i�1z }| {0 . . . 0]
b2 = [pi+1,1 . . . pi+1,n].
By an argument similar with the one we made in the proofof Proposition 5, we can compute F
0 in the following twosteps from F, each costing O(n2
) operations for the providedparenthesization
Z = F� (Za2)(bT2 Z)/(1 + b
T2 (Za2)),
F
0= Z+ (Fa1)(b
T1 F)/(1 + b
T1 (Fa1)).
We have thus shown that, given F, we can compute F
0, andtherefore acQ(C
0) as well, in time O(n2
).
where the inequality comes from the fact that a path in GX
passing from Z and being absorbed by X corresponds to ashorter path in GY being absorbed by Y .
B. Proposition 5
Proposition Let Ci�1 be a set of i � 1 absorbing nodes,Pi�1 the corresponding transition matrix, and Fi�1 = (I �Pi�1)
�1. Let Ci = Ci�1 [ {u}. Given Fi�1, the centralityscore acQ(Ci) can be computed in time O(n2
).
The proof makes use of the following lemma.
Lemma 1 (Sherman-Morrison Formula [7]) Let M be asquare n⇥n invertible matrix and M
�1 its inverse. Moreover,let a and b be any two column vectors of size n. Then, thefollowing equation holds
(M+ ab
T)
�1= M
�1 �M
�1ab
TM
�1/(1 + b
TM
�1a).
Proof: (Proposition 5) Without loss of generality, let theset of absorbing nodes be Ci�1 = {1, 2, . . . , i � 1}. Forsimplicity, assume no self-loops for non-absorbing nodes, i.e.,Pu,u = 0 for u 2 V � Ci�1. As in Section VI, the expectednumber of steps before absorption is given by the formulas
acQ(Ci�1) = s
TQFi�11,
with Fi�1 = A
�1i�1 and Ai�1 = I�Pi�1.
We proceed to show how to increase the set of absorbing nodesby one and calculate the new absorption time by updating Fi�1
in O(n2). Without loss of generality, suppose we add node i
to the absorbing nodes Ci�1, so that
Ci = Ci�1 [ {i} = {1, 2, . . . , i� 1, i}.Let Pi be the transition matrix over G with absorbing nodesCi. Like before, the expected absorption time by nodes Ci isgiven by the formulas
acQ(Ci) = s
TQFi1,
with Fi = A
�1i and Ai = I�Pi.
Notice that
Ai �Ai�1 = (I�Pi)� (I�Pi�1) = Pi�1 �Pi
=
2
40(i�1)⇥n
pi,1 . . . pi,n0(n�i)⇥n
3
5= ab
T
where pi,j denotes the transition probability from node i tonode j in the transition matrix Pi�1, and the column-vectorsa and b are defined as
a = [
i�1z }| {0 . . . 0 1
n�iz }| {0 . . . 0], and
b = [pi,1 . . . pi,n].
By a direct application of Lemma 1, it is easy to see that wecan compute Fi from Fi�1 with the following formula, at acost of O(n2
) operations.
Fi = Fi�1 � (Fi�1a)(bTFi�1)/(1 + b
T(Fi�1a))
We have thus shown that, given Fi�1, we can compute Fi,and therefore acQ(Ci) as well, in O(n2
).
C. Proposition 6
Proposition Let C be a set of absorbing nodes, P thecorresponding transition matrix, and F = (I � P)
�1. LetC 0
= C � {v} [ {u}, for u, v 2 C. Given F, the centralityscore acQ(C 0
) can be computed in time O(n2).
Proof: The proof is similar to the proof of Proposition 5.Again we assume no self-loops for non-absorbing nodes, i.e.,Puu = 0 for u 2 V � C. Without loss of generality, let thetwo sets of absorbing nodes be
C = {1, 2, . . . , i� 1, i}, andC 0
= {1, 2, . . . , i� 1, i+ 1}.Let P0 be the transition matrix with absorbing nodes C 0. Theabsorbing centrality for the two sets of absorbing nodes C andC 0 is expressed as a function of the following two matrices
F = A
�1, with A = I�P, and
F
0= A
0�1, with A
0= (I�P
0).
Notice that
A
0 �A = (I�P
0)� (I�P) = P�P
0
=
2
664
0(i�1)⇥n
�pi,1 . . . � pi,npi+1,0 . . . pi+1,n
0(n�i�1)⇥n
3
775 = a2bT2 � a1b
T1
where pi,j denotes the transition probability from node i tonode j in a transition matrix P0 where neither node i or i+1
is absorbing, and the column-vectors a1, b1, a2, b2 are definedas
a1 = [
i�1z }| {0 . . . 0 1 0
n�i�1z }| {0 . . . 0]
b1 = [pi,1 . . . pi,n]
a2 = [
i�1z }| {0 . . . 0 0 1
n�i�1z }| {0 . . . 0]
b2 = [pi+1,1 . . . pi+1,n].
By an argument similar with the one we made in the proofof Proposition 5, we can compute F
0 in the following twosteps from F, each costing O(n2
) operations for the providedparenthesization
Z = F� (Za2)(bT2 Z)/(1 + b
T2 (Za2)),
F
0= Z+ (Fa1)(b
T1 F)/(1 + b
T1 (Fa1)).
We have thus shown that, given F, we can compute F
0, andtherefore acQ(C
0) as well, in time O(n2
).
where the inequality comes from the fact that a path in GX
passing from Z and being absorbed by X corresponds to ashorter path in GY being absorbed by Y .
B. Proposition 5
Proposition Let Ci�1 be a set of i � 1 absorbing nodes,Pi�1 the corresponding transition matrix, and Fi�1 = (I �Pi�1)
�1. Let Ci = Ci�1 [ {u}. Given Fi�1, the centralityscore acQ(Ci) can be computed in time O(n2
).
The proof makes use of the following lemma.
Lemma 1 (Sherman-Morrison Formula [7]) Let M be asquare n⇥n invertible matrix and M
�1 its inverse. Moreover,let a and b be any two column vectors of size n. Then, thefollowing equation holds
(M+ ab
T)
�1= M
�1 �M
�1ab
TM
�1/(1 + b
TM
�1a).
Proof: (Proposition 5) Without loss of generality, let theset of absorbing nodes be Ci�1 = {1, 2, . . . , i � 1}. Forsimplicity, assume no self-loops for non-absorbing nodes, i.e.,Pu,u = 0 for u 2 V � Ci�1. As in Section VI, the expectednumber of steps before absorption is given by the formulas
acQ(Ci�1) = s
TQFi�11,
with Fi�1 = A
�1i�1 and Ai�1 = I�Pi�1.
We proceed to show how to increase the set of absorbing nodesby one and calculate the new absorption time by updating Fi�1
in O(n2). Without loss of generality, suppose we add node i
to the absorbing nodes Ci�1, so that
Ci = Ci�1 [ {i} = {1, 2, . . . , i� 1, i}.Let Pi be the transition matrix over G with absorbing nodesCi. Like before, the expected absorption time by nodes Ci isgiven by the formulas
acQ(Ci) = s
TQFi1,
with Fi = A
�1i and Ai = I�Pi.
Notice that
Ai �Ai�1 = (I�Pi)� (I�Pi�1) = Pi�1 �Pi
=
2
40(i�1)⇥n
pi,1 . . . pi,n0(n�i)⇥n
3
5= ab
T
where pi,j denotes the transition probability from node i tonode j in the transition matrix Pi�1, and the column-vectorsa and b are defined as
a = [
i�1z }| {0 . . . 0 1
n�iz }| {0 . . . 0], and
b = [pi,1 . . . pi,n].
By a direct application of Lemma 1, it is easy to see that wecan compute Fi from Fi�1 with the following formula, at acost of O(n2
) operations.
Fi = Fi�1 � (Fi�1a)(bTFi�1)/(1 + b
T(Fi�1a))
We have thus shown that, given Fi�1, we can compute Fi,and therefore acQ(Ci) as well, in O(n2
).
C. Proposition 6
Proposition Let C be a set of absorbing nodes, P thecorresponding transition matrix, and F = (I � P)
�1. LetC 0
= C � {v} [ {u}, for u, v 2 C. Given F, the centralityscore acQ(C 0
) can be computed in time O(n2).
Proof: The proof is similar to the proof of Proposition 5.Again we assume no self-loops for non-absorbing nodes, i.e.,Puu = 0 for u 2 V � C. Without loss of generality, let thetwo sets of absorbing nodes be
C = {1, 2, . . . , i� 1, i}, andC 0
= {1, 2, . . . , i� 1, i+ 1}.Let P0 be the transition matrix with absorbing nodes C 0. Theabsorbing centrality for the two sets of absorbing nodes C andC 0 is expressed as a function of the following two matrices
F = A
�1, with A = I�P, and
F
0= A
0�1, with A
0= (I�P
0).
Notice that
A
0 �A = (I�P
0)� (I�P) = P�P
0
=
2
664
0(i�1)⇥n
�pi,1 . . . � pi,npi+1,0 . . . pi+1,n
0(n�i�1)⇥n
3
775 = a2bT2 � a1b
T1
where pi,j denotes the transition probability from node i tonode j in a transition matrix P0 where neither node i or i+1
is absorbing, and the column-vectors a1, b1, a2, b2 are definedas
a1 = [
i�1z }| {0 . . . 0 1 0
n�i�1z }| {0 . . . 0]
b1 = [pi,1 . . . pi,n]
a2 = [
i�1z }| {0 . . . 0 0 1
n�i�1z }| {0 . . . 0]
b2 = [pi+1,1 . . . pi+1,n].
By an argument similar with the one we made in the proofof Proposition 5, we can compute F
0 in the following twosteps from F, each costing O(n2
) operations for the providedparenthesization
Z = F� (Za2)(bT2 Z)/(1 + b
T2 (Za2)),
F
0= Z+ (Fa1)(b
T1 F)/(1 + b
T1 (Fa1)).
We have thus shown that, given F, we can compute F
0, andtherefore acQ(C
0) as well, in time O(n2
).