View
4
Download
0
Category
Preview:
Citation preview
Centrality MeasuresComputing Closeness and Betweennes
Andrea Marino
PhD Course on Graph Mining Algorithms,Universita di Pisa
Pisa, February 2018
Andrea Marino Centrality Measures
Centrality measures
The problem of identifying the most central nodes in anetwork is a fundamental question that has been asked manytimes in a plethora of research areas, such as
biology,computer science,sociology, andpsychology...
Because of the importance of this question, dozens ofcentrality measures have been introduced in the literature.
Paolo Boldi, Sebastiano Vigna: Axioms for Centrality. InternetMathematics 10(3-4): 222-262 (2014)
Andrea Marino Centrality Measures
Some of them
Local indices
(In)degreeNumber of triangles
Spectral indices, based on some linear-algebra construction
Recursive definition: node is important if connected toimportant vertices (PageRank, Katz, Seeley)
Path-based indices, based on the number of paths or shortestpaths passing through a vertex!
Shortest paths passing through a vertex (Betwenness)Paths ending in a vertex (different point of view for KatzIndex)
Geometric indices, based on distances from a vertex to othervertices
Average distance of one vertex to all the others (Closeness orHarmonic)
Andrea Marino Centrality Measures
Closeness and Betweenness
Closeness and betweenness centrality are certainly two of theoldest and of the most widely used:
almost all books dealing with network analysis discuss them,almost all existing network analysis libraries implementalgorithms to compute them.
We will examine algorithms for computing these two centralitymeasures, by restricting ourselves to unweighted graphs.
Andrea Marino Centrality Measures
Part I
Closeness Centrality
Andrea Marino Centrality Measures
Closeness
Main Idea
A central node should be very efficient in spreadinginformation to all other nodes
A node is central if the average number of links needed toreach another node is small.
Definition
In a connected graph, the closeness centrality of a node v isdefined as c(v) = n−1
f (v) , where f (v) =∑
w∈V d(v ,w) is the farness
of v , and d(v ,w) is the distance between the two vertices v and w(that is, the number of edges in a shortest path from v to w).
Andrea Marino Centrality Measures
If the graph is not (strongly) connected
Researchers have proposed various ways to extend this definition:here, we focus on Lin’s index.
Definition
Let R(v) be the set of vertices reachable from v , and let r(v)denote its cardinality (note that v ∈ R(v) by definition). Then, thecloseness centrality of a node v is equal to
c(v) =r(v)− 1
f (v)
r(v)− 1
n − 1=
(r(v)− 1)2
(n − 1)f (v)
where f (v) =∑
w∈R(v) d(v ,w).
N. Lin. Foundations of social research. McGraw-Hill, 1976.
Andrea Marino Centrality Measures
Exact Closeness Computation
Computing the closeness value for each node v can be easilydone by executing a breadth-first search starting from v .
If we want to compute the closeness value of all nodes of thegraph, then the time complexity would be O(nm)
O(nm) time complexity is not affordable whenever we dealwith very large graphs.
Andrea Marino Centrality Measures
Approximating Closeness Centrality
Consider a simple algorithm for computing the closeness centralityin undirected unweighted graphs, which is based on randomsampling.This algorithm performs k breadth-first searches from k randomnodes v1, . . . , vk and, for any node u, return
c(u) =1∑k
i=1nd(vi ,u)k(n−1)
.
Theorem
If k = Θ(
log nε2
), with high probability
∣∣∣ 1c(u) −
1c(u)
∣∣∣ < ε.
David Eppstein, Joseph Wang: Fast approximation ofcentrality. SODA 2001: 228-229
Andrea Marino Centrality Measures
Top-k central nodes
We focus on the problem of computing only the k mostimportant nodes with respect to the closeness centrality
computing an approximation of the closeness values does notguarantee that we can determine the top-k nodes.
Theorem
On directed sparse graphs, in the worst case, an algorithmcomputing the most closeness central vertex in time O(m2−ε) forsome ε > 0 would falsify SETH.
Elisabetta Bergamini, Michele Borassi, Pierluigi Crescenzi,Andrea Marino, Henning Meyerhenke: Computing top-kCloseness Centrality Faster in Unweighted Graphs. CoRRabs/1704.01077 (2017). To appear.
Andrea Marino Centrality Measures
Exact Computation
The simplest algorithm for computing the k vertices withlargest closeness
1 It performs a breadth-first search from each vertex v ,
2 it computes its closeness c(v), and,
3 finally, it returns the k vertices with biggest c(v) values.
Andrea Marino Centrality Measures
Speeding up the Algorithm: General Schema
1 It sets c(v) equal to the result of a pruned breadth-firstsearch,
this pruned BFS receives in input the starting node v and avalue xk , which is the k-th biggest closeness value found untilnow (xk = 0 if we have not processed at least k vertices).
2 If this pruned BFS returns the value 0, it means that v is notone of k most central vertices, otherwise c(v) is the actualcloseness of v .
3 At the end, the k vertices with biggest closeness values areagain the k most central vertices.
The order of the nodes
To speed-up the pruned BFS, we want xk to be as big as possible,and consequently we need to process central vertices as soon aspossible. To this purpose, we process vertices in decreasing orderof degree.
Andrea Marino Centrality Measures
Time cost analysis
This algorithm needs a pre-processing, which requires lineartime.
It requires time O(n log n) to sort vertices, and it needs apriority queue containing at each step the k most centralvertices.
Since all other operations need time O(1), the total runningtime is O(m + n log n + n log k + T ) = O(m + n log n + T )),
where log k is the time necessary to execute extraction andupdate operations in a priority queue andT is the time needed to perform the pruned breadth-firstsearch n times.
We can easily parallelise this algorithm, by giving each vertex to adifferent thread: there could be some race condition on xk but itdoes not affect correctness and performance.
Andrea Marino Centrality Measures
The pruned breadth-first search
Reminder
A pruned BFS receives in input the starting node v and a value xk ,which is the k-th biggest closeness value found until now (xk = 0 ifwe have not processed at least k vertices).
The pruning of a BFS started from node v makes use of anupper bound cv () on the closeness of v , which has to beupdated whenever, for any d ≥ 0, the exploration of the d-thlevel of the breadth-first search tree is finished.
This upper bound cv () is obtained by proving a lower boundon the farness of v , i.e. f (v), since:
c(v) =(r(v)− 1)2
(n − 1)f (v)
where recall that f (v) =∑
w∈R(v) d(v ,w).
Andrea Marino Centrality Measures
A lower bound on the farness
If Γd(v) denotes the nodes at level d of the BFS tree started fromv and if γd(v) = |Γd(v)|, then
f (v) ≥ fd(v) + (d + 1)γd+1(v) + (d + 2)(r(v)− nd+1(v)),
where
fd(v) =d∑
i=1
i · |Γi (v)| and nd(v) =d∑
i=1
|Γi (v)|.
Since nd+1(v) = γd+1(v) + nd(v), we have that
f (v) ≥ fd(v)− γd+1(v) + (d + 2)(r(v)− nd(v)).
Andrea Marino Centrality Measures
f (v) ≥ fd(v)− γd+1(v) + (d + 2)(r(v)− nd(v)).
At the end of the exploration of the d-th level of the breadth-firstsearch tree, we don’t know yet the value of γd+1(v). However, wecan certainly say that this value is not greater than the sum of thedegrees of all nodes at level d , that is,
γd+1(v) ≤∑
u∈Γd (v)
deg(v) := γd+1(v).
Hence,
f (v) ≥ fd(v)− γd+1(v) + (d + 2)(r(v)− nd(v)) := fd(v , r(v)).
This lower bound on the farness implies the following upper boundon the closeness:
c(v) ≤ (r(v)− 1)2
(n − 1)
1
fd(v , r(v)):= cd(v).
Andrea Marino Centrality Measures
Summary
After the exploration of the first d levels of the breadth-first searchstarted from node v , an upper bound on the closeness value of v
can be computed.
Andrea Marino Centrality Measures
Using the upper bound to prune the BFS
Reminder
A pruned BFS receives in input the starting node v and a value xk ,which is the k-th biggest closeness value found until now (xk = 0 ifwe have not processed at least k vertices).
At the end of the exploration of the d-th level of thebreadth-first search tree, the upper bound cd(v) is computedand compared to xk .
If xk > cd(v) ≥ c(v), the BFS is interrupted, since for surethe node v is not among the top-k vertices.
Otherwise, the BFS continues with the exploration of the nextlevel of the search tree.
Andrea Marino Centrality Measures
Computing the upper bound
We have said:
c(v) ≤ (r(v)− 1)2
(n − 1)
1
fd(v , r(v)):= cd(v).
Everything is known after the exploration of the d-th level ofthe BFS tree apart from the value r(v).
If the graph is undirected, r(v) can be easily pre-computed(connected components).If the graph is directed and strongly connected, then r(v) = n.It remains to deal with the case in which the graph is directedbut not strongly connected.
Andrea Marino Centrality Measures
The directed but not strongly connected case
Let us assume, for now, that we know a lower (respectively,upper) bound α(v) (respectively, ω(v)) on r(v)
without loss of generality we can assume that α(v) > 1.
We now show that, instead of examining all possible values ofr(v) between α(v) and ω(v), it is sufficient to examine onlythe two extremes of this interval.
We prove the following lower bound λd(v) on 1c(v) :
Lemma
1
c(v)≥ λd(v) = (n − 1) min
(fd(v , α(v))
(α(v)− 1)2,
fd(v , ω(v))
(ω(v)− 1)2
).
Andrea Marino Centrality Measures
If we denote a = d + 2 and b = γd+1(v) + a(nd(v)− 1)− fd(v),we have that
f (v) ≥ fd(v)− γd+1(v) + a(r(v)− nd(v))
= a(r(v)− 1) + fd(v)− γd+1(v)− a(nd(v)− 1)
= a(r(v)− 1)− b.
Note that a > 0 because d > 0, and b > 0 because
fd(v) =∑
w∈Nd (v)
d(v ,w) ≤ d(nd(v)− 1) < a(nd(v)− 1)
where Nd(v) =⋃d
i=1 Γi (v) and the first inequality holds because, ifw = v , then d(v ,w) = 0, and if w ∈ Nd(v), then d(v ,w) ≤ d .Hence,
1
c(v)≥ (n − 1)
a(r(v)− 1)− b
(r(v)− 1)2.
Andrea Marino Centrality Measures
Let us consider the function g(x) = ax−bx2 .
The derivative g ′(x) = −ax+2bx3 is positive for 0 < x < 2b
aand negative for
x > 2ba
:
this means that 2ba is a local maximum, and there are no local
minima for x > 0.Consequently, in each closed interval [x1, x2] where x1 and x2
are positive, the minimum of g(x) is reached in x1 or x2.
Since 0 < α(v)− 1 ≤ r(v)− 1 ≤ ω(v)− 1,
g(r(v)− 1) ≥ min(g(α(v)− 1), g(ω(v)− 1))
The plot of function g(x) = ax−bx2 with
a = b = 1. There is a local maximumbut no local minimum: hence, in eachclosed interval the minimum is reachedin the extremes of the interval.
Andrea Marino Centrality Measures
It now remains to compute α(v) and ω(v) (in the case of adirected graph which is not strongly connected).This can be done during the pre-processing phase of the algorithmas follows.
Let Gscc be the component graph of G and, for any SCC D,let w(D) denote the number of nodes in D.
If v and w are in the same SCC, thenr(v) = r(w) =
∑D∈r(C) w(D), where r(C ) denotes the set of
SCCs that are reachable from C in Gscc.
Hence, we simply need to compute a lower (respectively,upper) bound α(C ) (respectively, ω(C )) on
∑D∈r(C) w(D),
for every SCC C .
Andrea Marino Centrality Measures
To compute a lower (respectively, upper) bound α(C) (respectively, ω(C)) on∑D∈r(C) w(D), for every SCC C .
We first compute a topological sort {C1, . . . ,Cl} of Vscc (that is, if(Ci ,Cj) ∈ Escc, then i < j).
Successively, we use a dynamic programming approach, and, by startingfrom Cl , we process the SCCs in reverse topological order, and we set
α(C) = w(C) + max(C ,D)∈Escc
α(D) ω(C) = w(C) +∑
(C ,D)∈Escc
ω(D).
Processing the SCCs in reverse topological ordering ensures that thevalues α(D) and ω(D) on the right hand side of these equalities areavailable when we process the SCC C .
Clearly, the complexity of computing α(C) and ω(C), for each SCC C , islinear in the size of G, which is smaller than G .
Andrea Marino Centrality Measures
Observe that the bounds obtained through this simple approachcan be improved by using some “tricks”.
When the biggest SCC C is processed, we do not use thedynamic programming approach and we can exactly compute∑
D∈r(C) w(D) by simply performing a BFS starting from any
node in C .
We get exact α(C ) and ω(C )Also α(C ) and ω(C ) are improved for each SCC C from whichit is possible to reach C .
In order to compute the upper bounds for the SCCs that areable to reach C , we can run the dynamic programmingalgorithm on the graph obtained from Gscc by removing allcomponents reachable from C , and we can then add∑
D∈r(C) w(D).
Andrea Marino Centrality Measures
IMDB
Analyzing the Internet Movie DataBase (in short, IMDB)graph, where nodes are actors, and two actors are connected ifthey played together in a movie (TV-series are ignored).
The data can be collected from the websitehttp://www.imdb.com (some genres can be excluded such asawards-shows, documentaries, game-shows, news, realities andtalk-shows).
We can then analyze snapshots of the actor graph, takenevery 5 years from 1940 to 2010, and 2014.
Andrea Marino Centrality Measures
The most central actors in the IMDB graph with respect to thecloseness centrality measure.The total time needed to perform the computation with 30 threadsis less than 40 minutes!
Andrea Marino Centrality Measures
Part II
Betweenness centrality
Andrea Marino Centrality Measures
Another popular centrality measure is betweenness centrality,which ranks the nodes according to their participation in theshortest paths between other node pairs.
Intuitively, betweenness measures a node’s influence on theinformation flow circulating through the social network, underthe assumption that the flow follows shortest paths.
Andrea Marino Centrality Measures
Definition
Let σs,t be the number of shortest paths going from node s tonode t, and let σs,t(v) be the number of shortest paths going fromnode s to node t and passing through node v . Then, thebetweenness centrality value of node v is equal to
b(v) =∑
s 6=v ,t 6=v
σs,t(v)
σs,t.
In order to compute b(v), we can compute the contribution of anode s 6= v to b(v), that is, the value
bs(v) =∑t 6=v
σs,t(v)
σs,t.
Andrea Marino Centrality Measures
To this aim, we make use of the so-called Brandes algorithm,which performs two basic steps (in the following, we will restrictourselves to undirected unweighted connected graphs).
1 An augmented breadth-first search starting from s, whichallows us to compute, for every node t, the value σs,t .
2 An accumulation phase which uses the breadth-first searchDAG constructed during the previous phase, in order tocompute, for every node v , the value bs(v).
U.Brandes. A faster algorithm for betweenness centrality. TheJournal of Mathematical Sociology, 25:163–177, 2001.
Andrea Marino Centrality Measures
Augmented breadth-first search
During the augmented BFS, each node v maintains a list A(v) ofits predecessors in a shortest path from the starting node s, andthe number σs,v of shortest path from s to v .
Each time a node v is inserted into the queue, the node uwhich inserted it is also memorized along with its distancefrom s and its value σs,u.
Once the node v is extracted from the queue, ifd(s, v) = d(s, u) + 1, then the node u is added to the listA(v), and σs,v is increased of σs,u.
At the end of the augmented breadth-first search each node v hascomputed its value σs,v and the set of all predecessors.
Andrea Marino Centrality Measures
An Example
Andrea Marino Centrality Measures
After the augmented phase, each node v has computed its valueσs,v and the set of all predecessors.
Andrea Marino Centrality Measures
Accumulation phase
During the accumulation phase, each node v distributes its valueσs,v to its predecessors u1, . . . , uh, proportionally to their valuesσs,ui .
Each node v receives from its successors w1, . . . ,wk in thebreadth-first search DAG a value x1, . . . , xk .
It then computes the sum X (v) = 1 + x1 + · · ·+ xk , and it“sends” to each ui the value X (v)
σs,uiσs,v
.
By using the dynamic programming technique, this processcan be done in linear time starting from the nodes in the DAGwhich have no outgoing edges.
All the information necessary to execute the process areavailable at the right time.
Finally, for each node v , we can set σs,v = X (v)− 1(remember that we do not want to count the paths arriving atX ).
Andrea Marino Centrality Measures
After the accumulation phase each node v has computed its valueσs,v .
Andrea Marino Centrality Measures
The whole algorithm
By executing the augmented BFS starting from each node sand by executing the corresponding accumulation phase, wecan compute for each v the value σs,v .
Hence, the last step is to compute the betweenness centralityvalue of v by summing up all these values, that is,
b(v) =∑s 6=v
σs,v .
For the time complexity of the Brandes algorithm, the timecomplexity if O(nm), since we are visiting twice thebreadth-first search DAG starting from each node s.
Since this time complexity is not affordable whenever thegraph is very large, several approximation algorithms has beenproposed.
Andrea Marino Centrality Measures
Thanks
Part of these slides are based on a chapter written by PierluigiCrescenzi for his course ”Algorithms for Graph Mining”.
Andrea Marino Centrality Measures
Recommended