ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with Rademacher Averages

ABRA: APPROXIMATING BETWEENNESS CENTRALITY IN STATIC AND DYNAMIC GRAPHS WITH RADEMACHER AVERAGES

Matteo Riondata and Eli Upfal 22nd ACM SIGKDD Conference, August 2016

1

Murata Lab - Paper reading seminar

Presented by: Kaushalya Madhawa (25th November 2016)

OUTLINE1. INTRODUCTION

2. RANDOM SAMPLING FOR APPROXIMATIONS

3. STATISTICAL LEARNING THEORY

‣ representativeness of a sample

‣ Rademacher averages

4. EXPERIMENTS AND RESULTS

2

BETWEENNESS CENTRALITY (BC)▸ unweighted graph G = (V, E)

▸ n = |V|, m = |E|

3

b(w) = 1|V | (|V | −1)

∑(u ,v)∈VXVσ uv(w)σ uv

W

V

σ uv(w) - number of shortest paths from u to v passing through w U

BETWEENNESS CENTRALITY (BC)▸ unweighted graph G = (V, E)

▸ n = |V|, m = |E|

▸ fastest exact betweenness calculation algorithm runs in O(nm) [Brandes 2001]

▸ requires O(n+m) space

4

b(w) = 1|V | (|V | −1)

∑(u ,v)∈VXVσ uv(w)σ uv

W

V

σ uv(w) - number of shortest paths from u to v passing through w U

▸ these methods are based on random sampling to estimate betweenness centrality with an acceptable accuracy

▸ problem definition

▸ given ε, δ ∈ (0, 1), an (ε, δ) approximation to B is a collection such that

APPROXIMATE BC FOR LARGE NETWORKS 5

CONTRIBUTIONS OF THIS PAPER

▸ progressive sampling based BC approximation within ε additive factor

▸ first BC approximation algorithm to estimate BC without depending on any global property of the graph

▸ ie: RK algorithm [Riandato and Karnopoulis 2016] depends on Vertex diameter of the graph

6

RANDOM SAMPLING TO APPROXIMATE BETWEENNESS 7

PROGRESSIVE SAMPLING 8

PROGRESSIVE SAMPLING▸ What is a good stopping condition?

▸ guarantees that the computed approximation fulfills the desired quality properties

▸ can be evaluated efficiently

▸ is tight (satisfied at small sample sizes)

▸ Determining sampling schedule

▸ minimize the number of iterations that are needed before the stopping condition is satisfied

9

RECAP OF STATISTICAL LEARNING THEORY

▸ A training set S is called (w.r.t. domain Z , hypothesis class H , loss function l , and distribution D ) if

▸ representativeness of sample S with respect to F is defined as the largest gap between the true error of a function f and its empirical error

10

ε − representative

suph∈H

| LD (h)− LS (h) | ≤ ε

LD ( f ) = EZ~D[ f (z)] LS ( f ) =1m

fi=1

m

∑ (zi )

RepD (F,S) = supf∈F(LD ( f )− LS ( f ))

given f ∈F,

REPRESENTATIVENESS OF A SAMPLE▸ how to estimate representative of S using a single sample?

11

S =

S = supf∈F(LS1 ( f )− LS2 ( f ))

S = 2msupf∈F

σ ii=1

m

∑ f (zi )

σ = (σ 1,..,σ m )∈{±1}m

RADEMACHER AVERAGE 12

‣ Rademacher complexity measure captures this idea by considering the expectation of the above with respect to a random choice of σ

F°S = {( f (z1),...., f (zm )) : f ∈F}

R(F°S) = 1mEσ ~{±1}[sup

f∈Fσ i

i=1

m

∑ f (zi )] σ be distributed i.i.d. according to P[i = 1] = P[i = 1] = 0.5

LD ( f )− LS ( f ) ≤ 2E ′S ~DmR(F° ′S )+ c 2ln(2 /δ )m

BACK TO BC‣ for each node w, is the fraction of shortest paths from u

to v going through w

13

fw (u,v)

LD ( fw ) =1|D |

σ uv(w)σuv(u ,v)∈VXV ,u≠v

∑ = b(w)

RADEMACHER AVERAGE: HOW TO CALCULATE?

▸ calculation is not straightforward and can be time consuming

▸ an upper bound to the Rademacher average is used in place of

14

R(F°S) = 1mEσ ~{±1}[sup

f∈Fσ i

i=1

m

∑ f (zi )]

R(F°S) ≤mins∈!+ω (s)

ω (s) = 1sln v∈υs

e∑ xp(s2 || v ||2 /(2m2 ))

vw = ( fw (u1,v1),..., fw (um ,vm ))

ν s = {vw ,w∈V} (|ν s |≤|V |)

R(F°S)

STOPPING CONDITION OF BC CALCULATION

▸ a tighter upper bound to maximum deviation average calculated [Oneto 2013]

15

Δ s =ω *

1−α+ ln(2 /δ )2lα (1−α )

+ ln(2 /δ )2m

Δ s ≤ ε

α = ln(2 /δ )ln(2 /δ )+ (2lR(F°S)+ ln(2 /δ ))ln(2 /δ )

‣ when this holds collection is returned

SAMPLING SCHEDULE▸ initial sample size determined by

▸ next sample size ( ) is calculated assuming that , which is and upper bound to is also an upper bound to

16

R(F°Si )

R(F°Si+1)

Si+1

DYNAMIC GRAPH BC APPROXIMATION (ABRA-D)▸ vertex and edge insertions and deletions allowed

▸ two data structures introduced by Hayashi et al (2015) used

▸ Hypergraph sketch: weighted hyper edge representation of shortest paths

▸ Two-ball index: to efficiently detect the parts of the Hypergraph sketch that need to be modified

17

EXPERIMENTAL EVALUATION

▸ performance measured using

▸ runtime

▸ sample size

▸ accuracy

▸ algorithms compared

▸ BA [Brandes 2001] - exact algorithm

▸ RK [Riondato and Kornaropoulos 2016]

18

EXPERIMENTAL RESULTS▸ δ is is fixed to 0.1 ▸ given the logarithmic dependence of the sample size on

δ, impact on the results is limited

19

REFERENCES[1] U. Brandes. A faster algorithm for betweenness centrality. J. Math. Sociol., 25(2):163–177, 2001. doi: 10.1080/0022250X.2001.9990249

[2] M. Riondato and E. M. Kornaropoulos. Fast approximation of betweenness centrality through sampling. Data Mining and Knowledge Discovery, 30(2):438–475, 2015. ISSN 1573-756X. doi: 10.1007/s10618-015-0423-0.

[3] T. Hayashi, T. Akiba, and Y. Yoshida. Fully dynamic betweenness centrality maintenance on massive networks. Proceedings of the VLDB Endowment, 9(2), 2015

[4] L. Oneto, A. Ghio, D. Anguita, and S. Ridella. An improved analysis of the Rademacher data-dependent bound using its self bounding property. Neural Networks, 44:107–111, 2013.

20

Data & Analytics

ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with Rademacher Averages