Ranking of Closeness Centrality for Large-Scale Social ... ? Â· Ranking of Closeness Centrality for Large-Scale ... that under certain conditions, ... The problem to solve in this paper is to nd the top kvertices with the

Embed Size (px)

Text of Ranking of Closeness Centrality for Large-Scale Social ... ? Â· Ranking of Closeness Centrality...

  • Ranking of Closeness Centrality for Large-ScaleSocial Networks

    Kazuya Okamoto1, Wei Chen2, and Xiang-Yang Li3

    1 Kyoto University, okia@kuis.kyoto-u.ac.jp2 Microsoft Research Asia, weic@microsoft.com

    3 Illinois Institute of Technology and Microsoft Research Asiaxli@cs.iit.edu

    Abstract. Closeness centrality is an important concept in social net-work analysis. In a graph representing a social network, closeness cen-trality measures how close a vertex is to all other vertices in the graph. Inthis paper, we combine existing methods on calculating exact values andapproximate values of closeness centrality and present new algorithmsto rank the top-k vertices with the highest closeness centrality. We showthat under certain conditions, our algorithm is more efficient than thealgorithm that calculates the closeness-centralities of all vertices.

    1 Introduction

    Social networks have been the subject of study for many decades in social sci-ence research. In recent years, with the rapid growth of Internet and World WideWeb, many large-scale online-based social networks such as Facebook, Friend-ster appear, and many large-scale social network data, such as coauthorship net-works, become easily available online for analysis [Ne04a, Ne04b, EL05, PP02].A social network is typically represented as a graph, with individual personsrepresented as vertices, the relationships between pairs of individuals as edges,and the strengths of the relationships represented as the weights on edges (forthe purpose of finding the shortest weighted distance, we can treat lower-weightedges as stronger relationships). Centrality is an important concept in studyingsocial networks [Fr79, NP03]. Conceptually, centality measures how central anindividual is positioned in a social network. Within graph theory and networkanalysis, various measures (see [KL05] for details) of the centrality of a vertexwithin a graph have been proposed to determine the relative importance of avertex within the graph. Four measures of centrality that are widely used in net-work analysis are degree centrality, betweenness centrality4, closeness centrality,and eigenvector centrality5. In this paper, we focus on shortest-path closeness

    4 For a graph G = (V,E), the betweenness centrality CB(v) for a vertex v is CB(v) =s,t:s 6=t 6=v

    v(s,t)(s,t)

    where (s, t) is the number of shortest paths from s to t, and

    v(s, t) is the number of shortest paths from s to t that pass through v.5 Given a graph G = (V,E) with adjacency matrix A, let xi be the eigenvector central-

    ity of the ith node vi. Then vector x = (x1, x2, , xn)T is the solution of equation

  • centrality (or closeness centrality for short) [Ba50,Be65]. The closeness central-ity of a vertex in a graph is the inverse of the average shortest-path distance fromthe vertex to any other vertex in the graph. It can be viewed as the efficiencyof each vertex (individual) in spreading information to all other vertices. Thelarger the closeness centrality of a vertex, the shorter the average distance fromthe vertex to any other vertex, and thus the better positioned the vertex is inspreading information to other vertices.

    The closeness centrality of all vertices can be calculated by solving all-pairs shortest-paths problem, which can be solved by various algorithms takingO(nm + n2 log n) time [Jo77, FT87], where n is the number of vertices and mis the number of edges of the graph. However, these algorithms are not efficientenough for large-scale social networks with millions or more vertices. In [EW04],Eppstein and Wang developed an approximation algorithm to calculate the close-ness centrality in time O( logn2 (n log n + m)) within an additive error of forthe inverse of the closeness centrality (with probability at least 1 1n ), where > 0 and is the diameter of the graph.

    However, applications may be more interested in ranking vertices with highcloseness centralities than the actual values of closeness centralities of all ver-tices. Suppose we want to use the approximation algorithm of [EW04] to rankthe closeness centralities of all vertices. Since the average shortest-path distancesare bounded above by , the average difference in average distance (the inverseof closeness centrality) between the ith-ranked vertex and the (i + 1)th-rankedvertex (for any i = 1, . . . , n 1) is O(n ). To obtain a reasonable ranking result,we would like to control the additive error of each estimate of closeness central-ity to within O(n ), which means we set to (

    1n ). Then the algorithm takes

    O(n2 log n(n log n+m)) time, which is worse than the exact algorithm.

    Therefore, we cannot use either purely the exact algorithm or purely theapproximation algorithm to rank closeness centralities of vertices. In this paper,we show a method of ranking top k highest closeness centrality vertices, com-bining the approximation algorithm and the exact algorithm. We first providea basic ranking algorithm TOPRANK(k), and show that under certain condi-tions, the algorithm ranks all top k highest closeness centrality vertices (with

    high probability) in O((k+ n23 log

    13 n)(n log n+m)) time, which is better than

    Ax = x, where is the greatest eigenvalue of A to ensure that all values xi arepositive by the Perron-Frobenius theorem. Googles PageRank [BP98] is a variantof the eigenvector centrality measure. The PageRank vector R = (r1, r2, , rn)T ,where ri is the PageRank of webpage i and n is the total number of webpages, isthe solution of the equation

    R =1 dn

    1 + dLR.

    Here d is a damping factor set around 0.85, L is a modified webpage-adjacencymatrix: li,j = 0 if page j does not link to i, and normalised such that, for each j,ni=1 li,j = 1, i.e., li,j =

    ai,jdj

    where ai,j = 1 only if page j has link to page i, and

    dj =ni=1 ai,j is the out-degree of page j.

  • O(n(n log n+m)) (when k = o(n)), the time needed by a brute-force algorithmthat simply computes all average shortest distances and then ranks them. Wethen use a heuristic to further improve the algorithm. Our work can be viewedas the first step toward designing and evaluating efficient algorithms in findingtop ranking vertices with highest closeness centralities. We discuss in the endseveral open problems and future directions of this work.

    2 Preliminary

    We consider a connected weighted undirected graph G = (V,E) with n verticesand m edges (|V | = n, |E| = m). We use d(v, u) to denote the length of ashortest-path between v and u, and to denote the diameter of graph G, i.e., = maxv,uV d(v, u). The closeness centrality cv of vertex v [Be65] is definedas

    cv =n 1

    uV d(v, u). (2.1)

    In other words, the closeness centrality of v is the inverse of the average (shortest-path) distance from v to any other vertex in the graph. The higher the cv, theshorter the average distance from v to other vertices, and v is more importantby this measure. Other definitions of closeness centralities exist. For example,some define the closeness centrality of a vertex v as 1uV d(v,u) [Sa66], and some

    define the closeness centrality as the mean geodesic distance (i.e the shortest

    path) between a vertex v and all other vertices reachable from it, i.e., uV d(v,u)n1 ,where n 2 is the size of the networks connected component V reachable fromv. In this paper, we will focus on closeness centrality defined in equation (2.1).

    The problem to solve in this paper is to find the top k vertices with thehighest closeness centralities and rank them, where k is a parameter of thealgorithm. To solve this problem, we combine the exact algorithm [Jo77, FT87]and the approximation algorithm [EW04] for computing average shortest-pathdistances to rank vertices on the closeness centrality. The exact algorithm iteratesDijkstras single-source shortest-paths (SSSP for short) algorithm n times for alln vertices to compute the average shortest-path distances. The original DijkstrasSSSP algorithm [Di59] computes all shortest-path distances from one vertex, andit can be efficiently implemented in O(n log n+m) time [Jo77,FT87].

    The approximation algorithm RAND given in [EW04] also uses DijkstrasSSSP algorithm. RAND samples ` vertices uniformly at random and computesSSSP from each sample vertex. RAND estimates the closeness centrality of avertex using the average of ` shortest-path distances from the vertex to the `sample vertices instead of to all n vertices. The following bound on the accu-racy of the approximation is given in [EW04], which utilizes the Hoeffdingstheorem [Ho63]:6

    Pr{| 1cv 1cv| } 2

    n2`2

    logn (n1n )

    2, (2.2)

    6 In this paper, log means natural logarithm with base e.

  • for any small positive value , where cv is the estimated closeness centrality ofvertex v. Let av be the average shortest-path distance of vertex v, i.e.,

    av =uV d(v, u)

    n 1=

    1

    cv.

    Using the average distance, inequality (2.2) can be rewritten as

    Pr{|av av| } 2

    n2`2

    logn (n1n )

    2, (2.3)

    where av is the estimated average distance of vertex v to all other vertices. Ifthe algorithm uses ` = logn2 samples ( > 1 is a constant number) which willcause the probability of error at each vertex to be bounded above by 1n2 ,the probability of error anywhere in the graph is then bounded from aboveby 1n ( 1 (1

    1n2 )

    n). It means that the approximation algorithm calculates

    the average lengths of shortest-paths of all vertices in O( logn2 (n log n+m)) timewithin an additive error of with probability at least 1 1n , i.e., with highprobability (w.h.p.).

    3 Ranking algorithms

    Our top-k ranking algorithm is based on the approximation algorithm as wellas the exact algorithm. The idea is to first use the approximation algorithmwith ` samp