86
Parallel Graph Algorithms Kamesh Madduri [email protected]

Parallel Graph Algorithms Kamesh Madduri [email protected]

Embed Size (px)

Citation preview

Page 1: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

Parallel Graph Algorithms

Kamesh Madduri

[email protected]

Page 2: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

2

• Applications• Parallel algorithm building blocks

– Kernels– Data structures

• Parallel algorithm case studies– Connected components– BFS/Shortest paths– Betweenness Centrality

• Performance on current systems– Software– Architectures– Performance trends

Talk Outline

Parallel Graph Algorithms5/4/2009

Page 3: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

3

Road networks, Point-to-point shortest paths: 15 seconds (naïve) 10 microseconds

Routing in transportation networks

Parallel Graph Algorithms5/4/2009

H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”, Science 27, 2007.

Page 4: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

4

• The world-wide web can be represented as a directed graph– Web search and crawl: traversal– Link analysis, ranking: Page rank and HITS– Document classification and clustering

• Internet topologies (router networks) are naturally modeled as graphs

Internet and the WWW

Parallel Graph Algorithms5/4/2009

Page 5: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

5

• Reorderings for sparse solvers– Fill reducing orderings

• partitioning, traversals, eigenvectors– Heavy diagonal to reduce pivoting (matching)

• Data structures for efficient exploitation

of sparsity• Derivative computations for optimization

– Matroids, graph colorings, spanning trees

• Preconditioning– Incomplete Factorizations

– Partitioning for domain decomposition

– Graph techniques in algebraic multigrid

• Independent sets, matchings, etc.

– Support Theory

• Spanning trees & graph embedding techniques

Scientific Computing

B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”, http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf

Parallel Graph Algorithms5/4/2009

Page 6: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

6

• Graph abstractions are very useful to analyze complex data sets.

• Sources of data: petascale simulations, experimental devices, the Internet, sensor networks

• Challenges: data size, heterogeneity, uncertainty, data quality

Large-scale data analysis

Astrophysics: massive datasets, temporal variations

Bioinformatics: data quality, heterogeneity

Social Informatics: new analytics challenges, data uncertainty

Parallel Graph Algorithms5/4/2009

Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com

Page 7: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

7

• Study of the interactions between various components in a biological system

• Graph-theoretic formulations are pervasive:– Predicting new interactions:

modeling– Functional annotation of novel

proteins: matching, clustering– Identifying metabolic pathways:

paths, clustering– Identifying new protein

complexes: clustering, centrality

5/4/2009

Data Analysis and Graph Algorithms in Systems Biology

Parallel Graph Algorithms

Image Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003.

Page 8: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

8

Image Source: Nexus (Facebook application)

5/4/2009 Parallel Graph Algorithms

Graph –theoretic problems in social networks

– Community identification: clustering– Targeted advertising: centrality– Information spreading: modeling

Page 9: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

9

• [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain information

• Plot masterminds correctly identified from interaction patterns: centrality

• A global view of entities is often more insightful

• Detect anomalous activities by exact/approximate graph matching

Image Source: http://www.orgnet.com/hijackers.html

Network Analysis for Intelligence and Survelliance

5/4/2009 Parallel Graph Algorithms

Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47

Page 10: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

10

Characterizing Graph-theoretic computations

• graph sparsity (m/n ratio)• static/dynamic nature• weighted/unweighted, weight distribution• vertex degree distribution• directed/undirected• simple/multi/hyper graph• problem size• granularity of computation at nodes/edges• domain-specific characteristics

• graph sparsity (m/n ratio)• static/dynamic nature• weighted/unweighted, weight distribution• vertex degree distribution• directed/undirected• simple/multi/hyper graph• problem size• granularity of computation at nodes/edges• domain-specific characteristics

• paths• clusters• partitions• matchings• patterns• orderings

• paths• clusters• partitions• matchings• patterns• orderings

Input: Graph abstraction

Problem: Find ***

Factors that influence choice of algorithm

Graph kernel

• traversal• shortest path algorithms• flow algorithms• spanning tree algorithms• topological sort …..

• traversal• shortest path algorithms• flow algorithms• spanning tree algorithms• topological sort …..

Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations

Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations

Parallel Graph Algorithms5/4/2009

Page 11: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

11

• Applications• Parallel algorithm building blocks

– Kernels– Data structures

• Parallel algorithm case studies– Connected components– BFS/Shortest paths– Betweenness centrality

• Performance on current systems– Software– Architectures– Performance trends

Talk Outline

Parallel Graph Algorithms5/4/2009

Page 12: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

12

• Objectives– Bridge between software and hardware

• General purpose HW, scalable HW

• Transportable SW

– Abstract architecture for algorithm development

• Why is it so important?– Uniprocessor: von Neumann model of computation

– Parallel processors Multicore

• Requirements: inherent tension– Simplicity to make analysis of interesting problems tractable– Detailed to reveal the important bottlenecks

• Models, e.g.:– PRAM: rich collection of parallel graph algorithms– BSP: some CGM algorithms (cgmGraph)– LogP

Parallel Computing Models: A Quick PRAM review

Parallel Graph Algorithms5/4/2009

Page 13: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

13

• Ideal model of a parallel computer for analyzing the efficiency of parallel algorithms.

• PRAM composed of– P unmodifiable programs, each composed of optionally labeled

instructions.– a single shared memory composed of a sequence of words,

each capable of containing an arbitrary integer.– P accumulators, one associated with each program– a read-only input tape– a write-only output tape

• No local memory in each RAM.• Synchronization, communication, parallel overhead is

zero.

PRAM

5/4/2009 Parallel Graph Algorithms

Page 14: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

14

• EREW (Exclusive Read, Exclusive Write)– A memory cell can be read or written by at most one

processor per cycle.– Ensures no read or write conflicts.

• CREW (Concurrent Read, Exclusive Write)– Ensures there are no write conflicts.

• CRCW (Concurrent Read, Concurrent Write)– Requires use of some conflict resolution scheme.

PRAM Data Access Forms

5/4/2009 Parallel Graph Algorithms

Page 15: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

15

• Pros– Simple and clean semantics.– The majority of theoretical parallel algorithms are

specified with the PRAM model.– Independent of the communication network topology.

• Cons– Not realistic, too powerful communication model.– Algorithm designer is misled to use IPC without

hesitation.– Synchronized processors.– No local memory.– Big-O notation is often misleading.

PRAM Pros and Cons

5/4/2009 Parallel Graph Algorithms

Page 16: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

16

• Problem parameters: n, m, D (graph diameter)

• Worst-case running time: T

• Total number of operations (work): W

• Nick’s Class (NC): complexity class for problems that can be solved in poly-logarithmic time using a polynomial number of processors

• P-complete: inherently sequential

Analyzing Parallel Graph Algorithms

5/4/2009 Parallel Graph Algorithms

Page 17: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

17

• Extension to the PRAM model for shared memory algorithm design and analysis.

• T(n, p) is measured by the triplet –TM(n, p), TC(n, p), B(n, p)– TM(n, p): maximum number of non-contiguous main

memory accesses required by any processor

– TC(n, p): upper bound on the maximum local computational complexity of any of the processors

– B(n, p): number of barrier synchronizations.

The Helman-JaJa model

5/4/2009 Parallel Graph Algorithms

Page 18: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

18

• Prefix sums

• List ranking– Euler tours, Pointer jumping, Symmetry

breaking

• Sorting

• Tree contraction

Building blocks of classical PRAM graph algorithms

5/4/2009 Parallel Graph Algorithms

Page 19: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

19

Prefix Sums

5/4/2009 Parallel Graph Algorithms

• Input: A, an array of n elements; associative binary operation

• Output: nijAi

j

1for )(1

B(0,1)C(0,1)

B(0,2)C(0,2)

B(0,3)C(0,3)

B(0,4)C(0,4)

B(0,5)C(0,5)

B(0,6)C(0,6)

B(0,7)C(0,7)

B(0,8)C(0,8)

B(1,1)C(1,1)

B(1,2)C(1,2)

B(1,3)C(1,3)

B(1,4)C(1,4)

B(2,1)C(2,1)

B(2,2)C(2,2)

B(3,1)C(3,2)

O(n) work, O(log n) time,

n processors

Page 20: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

20

• X: array of n elements stored in arbitrary order.• For each element i, let X(i).value be its value and X(i).next be the

index of its successor.• For binary associative operator Θ, compute X(i).prefix such that

– X(head).prefix = X (head).value, and

– X(i).prefix = X(i).value Θ X(predecessor).prefix

where– head is the first element– i is not equal to head, and– predecessor is the node preceding i.

• List ranking: special case of parallel prefix, values initially set to 1, and addition is the associative operator.

Parallel Prefix

5/4/2009 Parallel Graph Algorithms

Page 21: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

21

• Ordered list (X.next values)

• Random list (X.next values)

List ranking Illustration

2 3 4 5 6 7 8 9

4 6 5 7 8 3 2 9

5/4/2009 Parallel Graph Algorithms

Page 22: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

22

1. Chop X randomly into s pieces

2. Traverse each piece using a serial algorithm.

3. Compute the global rank of each element using the result computed from the second step.

• Locality (list ordering) determines performance

• In the Helman-JaJa model, TM(n,p) = O(n/p).

List Ranking key idea

5/4/2009 Parallel Graph Algorithms

Page 23: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

23

Tarjan-Vishkin’s biconnected components algorithm: O(log n) time, O(m+n) time.

1.Compute spanning tree T for the input graph G.

2.Compute Eulerian circuit for T.

3.Root the tree at an arbitrary vertex.

4.Preorder numbering of all the vertices.

5.Label edges using vertex numbering

6.Connected components using the Shiloach-Vishkin algorithm

An example higher-level algorithm

5/4/2009 Parallel Graph Algorithms

Page 24: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

24

• Dense graphs (m = O(n2)): adjacency matrix commonly used.

• Sparse graphs: adjacency lists, similar to the CSR matrix format.

• Dynamic sparse graphs: we need to support edge and vertex membership queries, insertions, and deletions.– should be space-efficient, with low synchronization

overhead• Several different representations possible

– Resizable adjacency arrays– Adjacency arrays, sorted by vertex identifiers– Adjacency arrays for low-degree vertices, heap-based

structures for high-degree vertices (for sparse graphs with skewed degree distributions)

Data structures: graph representation

5/4/2009 Parallel Graph Algorithms

Page 25: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

25

• A wide range of ADTs in graph algorithms: array, list, queue, stack, set, multiset, tree

• ADT implementations are typically array-based for performance considerations.

• Key data structure considerations in parallel graph algorithm design– Practical parallel priority queues– Space-efficiency– Parallel set/multiset operations, e.g., union,

intersection, etc.

Data structures in (Parallel) Graph Algorithms

5/4/2009 Parallel Graph Algorithms

Page 26: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

26

• Applications• Parallel algorithm building blocks

– Kernels– Data structures

• Parallel algorithm case studies– Connected components– BFS/Shortest paths– Betweenness centrality

• Performance on current systems– Software– Architectures– Performance trends

Talk Outline

Parallel Graph Algorithms5/4/2009

Page 27: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

27

• Building blocks for many graph algorithms– Minimum spanning tree, spanning tree, planarity

testing, etc.

• Representative of the “graft-and-shortcut” approach

• CRCW PRAM algorithms– [Shiloach & Vishkin ’82]: O(log n) time, O((m+n) logn)

work– [Gazit ’91]: randomized, optimal, O(log n) time.

• CREW algorithms– [Han & Wagner ’90]: O(log2n) time, O((m+nlog n) logn)

work.

Connected Components

5/4/2009 Parallel Graph Algorithms

Page 28: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

28

• Input: n isolated vertices and m PRAM processors.

• Each processor Pi grafts a tree rooted at vertex vi to the tree that contains one of its neighbors u under the constraints u< vi

• Grafting creates k ≥ 1 connected subgraphs, and each subgraph is then shortcut so that the depth of the trees reduce at least by half.

• Repeat graft and shortcut until no more grafting is possible.

• Runs on arbitrary CRCW PRAM in O(logn) time with O(m) processors.

• Helman-JaJa model: TM = (3m/p + 2)log n, TB = 4log n.

Shiloach-Vishkin algorithm

Parallel Graph Algorithms5/4/2009

Page 29: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

29

• Input: (1) A set of m edges (i, j) given in arbitrary order.

(2) Array D[1..n] with D[i] = i• Output: Array D[1..n] with D[i] being the component to which vertex i

belongs.

begin

while true do

1. for (i, j) E in parallel ∈ do

if D[i]=D[D[i]] and D[j]<D[i] then D[D[i]] = D[j];

2. for (i, j) E in parallel do∈ if i belongs to a star and D[j]=D[i] then D[D[i]] = D[j];

3. if all vertices are in rooted stars then exit;

for all i in parallel do

D[i] = D[D[i]]

end

SV pseudo-code

Parallel Graph Algorithms5/4/2009

Page 30: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

30

SV Illustration

4

1

2

3

4

1

2

3

Input graph graft

1,4 2,3

shortcut

1 2 1 2 1

1st iter

2nd iter

Parallel Graph Algorithms5/4/2009

Page 31: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

31

• Applications• Parallel algorithm building blocks

– Kernels– Data structures

• Parallel algorithm case studies– Connected components– BFS/Shortest paths– Betweenness centrality

• Performance on current systems– Software– Architectures– Performance trends

Talk Outline

Parallel Graph Algorithms5/4/2009

Page 32: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

32

Parallel Single-source Shortest Paths (SSSP) algorithms• No known PRAM algorithm that runs in sub-linear time

and O(m+nlog n) work• Parallel priority queues: relaxed heaps [DGST88],

[BTZ98]• Ullman-Yannakakis randomized approach [UY90]• Meyer et al. ∆ - stepping algorithm [MS03]• Distributed memory implementations based on graph

partitioning• Heuristics for load balancing and termination detection

5/4/2009 Parallel Graph Algorithms

K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.

Page 33: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

33

∆ - stepping algorithm [MS03]

• Label-correcting algorithm: Can relax edges from unsettled vertices also

• ∆ - stepping: “approximate bucket implementation of Dijkstra’s algorithm”

• ∆: bucket width

• Vertices are ordered using buckets representing priority range of size ∆

• Each bucket may be processed in parallel

5/4/2009 Parallel Graph Algorithms

Page 34: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

34 5/4/2009 Parallel Graph Algorithms

Page 35: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

35

Classify edges as “heavy” and “light”

5/4/2009 Parallel Graph Algorithms

Page 36: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

36

Relax light edges (phase) Repeat until B[i] Is empty

5/4/2009 Parallel Graph Algorithms

Page 37: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

37

Relax heavy edges. No reinsertions in this step.

5/4/2009 Parallel Graph Algorithms

Page 38: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

38

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket∞ ∞ ∞ ∞ ∞ ∞ ∞

∆ = 0.1 (say)

5/4/2009 Parallel Graph Algorithms

Page 39: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

39

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket0 ∞ ∞ ∞ ∞ ∞ ∞

Initialization:Insert s into bucket, d(s) = 000

5/4/2009 Parallel Graph Algorithms

Page 40: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

40

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket

0 ∞ ∞ ∞ ∞ ∞ ∞

002R

S

.01

5/4/2009 Parallel Graph Algorithms

Page 41: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

41

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket0 ∞ ∞ ∞ ∞ ∞ ∞

2R

0S

.010

5/4/2009 Parallel Graph Algorithms

Page 42: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

42

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket0 ∞ .01 ∞ ∞ ∞ ∞

2R

0S

0

5/4/2009 Parallel Graph Algorithms

Page 43: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

43

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket0 ∞ .01 ∞ ∞ ∞ ∞

2R

0S

01 3

.03 .06

5/4/2009 Parallel Graph Algorithms

Page 44: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

44

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket0 ∞ .01 ∞ ∞ ∞ ∞

R

0S

01 3

.03 .06

2

5/4/2009 Parallel Graph Algorithms

Page 45: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

45

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket0 .03 .01 .06 ∞ ∞ ∞

R

0S

0

2

1 3

5/4/2009 Parallel Graph Algorithms

Page 46: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

46

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket0 .03 .01 .06 .16 .29 .62

R

0S

1

2 1 32

6

4

5

6

5/4/2009 Parallel Graph Algorithms

Page 47: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

47

No. of phases (machine-independent performance

count)

Graph Family

Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU SqGrid USAd NE USAt NE

No.

of

phas

es

10

100

1000

10000

100000

1000000

low diameter

high diameter

5/4/2009 Parallel Graph Algorithms

Page 48: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

48

Average shortest path weight for various graph families~ 220 vertices, 222 edges, directed graph, edge weights normalized to [0,1]

Graph Family

Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU SqGrid USAd NE USAt NE

Ave

rage

sho

rtes

t pa

th w

eigh

t

0.01

0.1

1

10

100

1000

10000

100000

5/4/2009 Parallel Graph Algorithms

Page 49: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

49

Last non-empty bucket (machine-independent performance count)

Graph Family

Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU SqGrid USAd NE USAt NE

Last

non

-em

pty

buck

et

1

10

100

1000

10000

100000

1000000

Fewer buckets, more parallelism

5/4/2009 Parallel Graph Algorithms

Page 50: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

50

Graph Family

Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU SqGrid USAd NE USAt NE

No

. of

Buc

ket

inse

rtio

ns

0

2000000

4000000

6000000

8000000

10000000

12000000

Number of bucket insertions (machine-independent performance count)

5/4/2009 Parallel Graph Algorithms

Page 51: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

51

• Applications• Parallel algorithm building blocks

– Kernels– Data structures

• Parallel algorithm case studies– Connected components– BFS/Shortest paths– Betweenness centrality

• Performance on current systems– Software– Architectures– Performance trends

Talk Outline

Parallel Graph Algorithms5/4/2009

Page 52: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

52

Betweenness Centrality

• Centrality: Quantitative measure to capture the importance of a vertex/edge in a graph– degree, closeness, eigenvalue, betweenness

• Betweenness Centrality

( : No. of shortest paths between s and t)• Applied to several real-world networks

– Social interactions– WWW– Epidemiology– Systems biology

tvs st

st vvBC

)()(

st

5/4/2009 Parallel Graph Algorithms

Page 53: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

53

Algorithms for Computing Betweenness

• All-pairs shortest path approach: compute the length and number of shortest paths between all s-t pairs (O(n3) time), sum up the fractional dependency values (O(n2) space).

• Brandes’ algorithm (2003): Augment a single-source shortest path computation to count paths; uses the Bellman criterion; O(mn) work and O(m+n) space.

5/4/2009 Parallel Graph Algorithms

Page 54: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

54

• Madduri, Bader (2006): parallel algorithms for computing exact and approximate betweenness centrality – low-diameter sparse graphs (diameter D = O(log n), m = O(nlog

n))– Exact algorithm: O(mn) work, O(m+n) space, O(nD) time

(PRAM model) or O(nD+nm/p) time.

• Madduri et al. (2009): New parallel algorithm with lower synchronization overhead and fewer non-contiguous memory references– In practice, 2-3X faster than previous algorithm– Lock-free => better scalability on large parallel systems

Our New Parallel Algorithms

Parallel Graph Algorithms5/4/2009

Page 55: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

55

Parallel BC Algorithm

• Consider an undirected, unweighted graph

• High-level idea: Level-synchronous parallel Breadth-First Search augmented to compute centrality scores

• Two steps– traversal and path counting– dependency accumulation

)(1)(

)()(

)(

ww

vv

wPv

Parallel Graph Algorithms5/4/2009

Page 56: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

56

Parallel BC Algorithm Illustration

G (size m+n): read-only, adjacency array representationBC (size n): centrality score of each vertex

• S (n): stack of visited vertices• Visited (n): Mark to check if vertex has been visited• D (n): Distance of vertex from source s• Sigma (n): No. of shortest paths through a vertex• Delta (n): Partial dependence score for each vertex• P (m+n): Multiset of predecessors of a vertex along shortest paths

Space requirement: 8(m+6n) Bytes

Data structures

0 7

5

3

8

2

4 6

1

9

Parallel Graph Algorithms5/4/2009

Page 57: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

57

Parallel BC Algorithm Illustration

1. Traversal step: visit adjacent vertices, update distanceand path counts.

0 7

5

3

8

2

4 6

1

9source vertex

Parallel Graph Algorithms5/4/2009

Page 58: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

58

Parallel BC Algorithm Illustration

1. Traversal step: visit adjacent vertices, update distanceand path counts.

0 7

5

3

8

2

4 6

1

9source vertex

27500

00

11

11

11

D

0

0 0

0

S P

Parallel Graph Algorithms5/4/2009

Page 59: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

59

Parallel BC Algorithm Illustration

1. Traversal step: visit adjacent vertices, update distanceand path counts.

0 7

5

3

8

2

4 6

1

9source vertex

88

27500

00

1122

11

1122

D

0

0 0

0

S P

33

2 7

5 7

Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel

Parallel Graph Algorithms5/4/2009

Page 60: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

60

Parallel BC Algorithm Illustration

1. Traversal step: at the end, we have all reachable vertices,their corresponding predecessor multisets, and D values.

0 7

5

3

8

2

4 6

1

9source vertex

2211664488

27500

00

1122

11

1122

D

0

0 0

0

S P

33

2 7

5 7

Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel

3 8

8

6

6

Parallel Graph Algorithms5/4/2009

Page 61: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

61

Step 1 (traversal) pseudo-code

for all vertices u at level d in parallel dofor all adjacencies v of u in parallel do

dv = D[v];if (dv < 0) // v is visited for the first time vis = fetch_and_add(&Visited[v], 1); if (vis == 0) // v is added to a stack only once D[v] = d+1;

pS[count++] = v; // Add v to local thread stack fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Pcount[v], 1); // Add u to predecessor list

of vif (dv == d + 1) fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Pcount[v], 1); // Add u to predecessor list

of v

v

e1

e2u2

u1

v1u

v2

Parallel Graph Algorithms5/4/2009

Page 62: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

62

• Exploit concurrency in visiting adjacencies, as we assume that the graph diameter is small: O(log n)

• Upper bound on size of each predecessor multiset: In-degree

• Potential performance bottlenecks: atomic updates to predecessor multisets, atomic increments of path counts

• New algorithm: Based on observation that we don’t need to store “predecessor” vertices. Instead, we store successor edges along shortest paths.– simplifies the accumulation step– reduces an atomic operation in traversal step– cache-friendly!

Graph traversal step analysis

Parallel Graph Algorithms5/4/2009

Page 63: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

63

Modified Step 1 pseudo-code

for all vertices u at level d in parallel dofor all adjacencies v of u in parallel do

dv = D[v];if (dv < 0) // v is visited for the first time vis = fetch_and_add(&Visited[v], 1); if (vis == 0) // v is added to a stack only once D[v] = d+1;

pS[count++] = v; // Add v to local thread stack fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); // Add v to successor list of

uif (dv == d + 1) fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); // Add v to successor list of

u

Parallel Graph Algorithms5/4/2009

Page 64: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

64

Graph Traversal Step locality analysis

for all vertices u at level d in parallel dofor all adjacencies v of u in parallel do

dv = D[v];if (dv < 0) vis = fetch_and_add(&Visited[v], 1); if (vis == 0) D[v] = d+1;

pS[count++] = v; fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); if (dv == d + 1) fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1);

All the vertices are in a contiguous block (stack)

All the adjacencies of a vertex are stored compactly (graph rep.)

Store to S[u]

Non-contiguous memory access

Non-contiguous memory access

Non-contiguous memory access

Better cache utilization likely if D[v], Visited[v], sigma[v] are stored contiguously

Parallel Graph Algorithms5/4/2009

Page 65: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

65

Parallel BC Algorithm Illustration

2. Accumulation step: Pop vertices from stack, update dependence scores.

0 7

5

3

8

2

4 6

1

9source vertex

2211664488

27500

Delta

0

0 0

0

S P

33

2 7

5 7

3 8

8

6

6

Parallel Graph Algorithms5/4/2009

)(1)(

)()(

)(

ww

vv

wPv

Page 66: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

66

Parallel BC Algorithm Illustration

2. Accumulation step: Can also be done in a level-synchronous manner.

0 7

5

3

8

2

4 6

1

9source vertex

2211664488

27500

Delta

0

0 0

0

S P

33

2 7

5 7

3 8

8

6

6

Parallel Graph Algorithms5/4/2009

)(1)(

)()(

)(

ww

vv

wPv

Page 67: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

67

Step 2 (accumulation) pseudo-code

for level d = GraphDiameter to 2 do

for all vertices w at level d in parallel dofor all v in P[w] do

acquire_lock(v);

delta[v] = delta[v] + (1 + delta[w]) * sigma(v)/sigma(w);

release_lock(v);

BC[v] = delta[v] )(1)(

)()(

)(

ww

vv

wPv

Parallel Graph Algorithms5/4/2009

Page 68: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

68

Modified Step 2 pseudo-code (w/ successor lists)

for level d = GraphDiameter-2 to 1 do

for all vertices v at level d in parallel dofor all w in S[v] in parallel do reduction(delta)

delta[v] = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w];

BC[v] = delta[v] )(1)(

)()(

)(

ww

vv

vSw

Parallel Graph Algorithms5/4/2009

Page 69: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

69

Accumulation step locality analysis

for level d = GraphDiameter-2 to 1 do

for all vertices v at level d in parallel dofor all w in S[v] in parallel do reduction(delta)

delta_sum_v = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w];

BC[v] = delta[v] = delta_sum_v;

All the vertices are in a contiguous block (stack)

Each S[v] is a contiguous block

Only floating point operation in code

Parallel Graph Algorithms5/4/2009

Page 70: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

70

Digression: Centrality Analysis applied to Protein Interaction Networks

Human Genome core protein interactionsDegree vs. Betweenness Centrality

Degree

1 10 100

Bet

wee

nnes

s C

entr

ality

1e-7

1e-6

1e-5

1e-4

1e-3

1e-2

1e-1

1e+0

43 interactionsProtein Ensembl ID

ENSG00000145332.2Kelch-like protein 8

5/4/2009 Parallel Graph Algorithms

Page 71: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

71

• Applications• Parallel algorithm building blocks

– Kernels– Data structures

• Parallel algorithm case studies– Connected components– BFS/Shortest paths– Betweenness centrality

• Performance on current systems– Software– Architectures– Performance trends

Talk Outline

Parallel Graph Algorithms5/4/2009

Page 72: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

72

• Information networks are very different from graph topologies and computations that arise in scientific computing.

• Classical graph algorithms typically assume a uniform random graph topology.

Informatics: dynamic, high-dimensional data

Static networks, Euclidean topologies

Image Sources: visualcomplexity.com (1,2), MapQuest (3)

Graph topology matters

5/4/2009 Parallel Graph Algorithms

Page 73: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

73

• Information networks are typically dynamic graph abstractions, from diverse data sources

• High-dimensional data• Skewed (“power law”) degree

distribution of the number of neighbors

• Low graph diameter• Massive networks (billions of

entities)

Image Source: Seokhee Hong

5/4/2009

“Small-world” complex networks

Kevin Bacon and six degrees of separation

Parallel Graph Algorithms

Human Protein Interaction data setDegree distribution (18669 proteins, 43568 interactions)

Degree

1 10 100 1000

Fre

quen

cy

0.1

1

10

100

1000

10000

Page 74: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

74

• Execution time is dominated by latency to main memory– Large memory footprint– Large number of irregular memory accesses

• Essentially no computation to hide memory costs• Poor performance on current cache-based

architectures (< 5-10% of peak achieved)• Memory access pattern is dependent on the

graph topology.• Variable degrees of concurrency in parallel

algorithm

5/4/2009

Implementation Challenges

Parallel Graph Algorithms

Page 75: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

75

Desirable HPC Architectural Features

• A global shared memory abstraction– no need to partition the graph– support dynamic updates

• A high-bandwidth, low-latency network• Ability to exploit fine-grained parallelism• Support for light-weight synchronization

• HPC systems with these characteristics – Massively Multithreaded architectures– Symmetric multiprocessors

5/4/2009 Parallel Graph Algorithms

Page 76: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

76

• Latency tolerance by massive multithreading

– hardware support for 128 threads on each processor

– Globally hashed address space– No data cache – Single cycle context switch– Multiple outstanding memory requests

• Support for fine-grained, word-level synchronization• 16 x 500 MHz processors, 128 GB RAM

Performance Results: Test Platforms

Parallel Graph Algorithms5/4/2009

Cray XMT Sun Fire T5120

• Sun Niagara2: Cache-based multicore server with chip multithreading

• 1 socket x 8 cores x 8 threads per core• 8 KB private L1 cache per core, 4 MB

shared L2 cache• 1167 MHz processor, 32 GB RAM

Page 77: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

77

Betweenness Centrality Performance

Number of processors/cores

1 2 4 8 12 16

Be

twe

en

ne

ss T

EP

S r

ate

(Mill

ion

s o

f ed

ge

s p

er

seco

nd

)

0

20

40

60

80

100

120

140

160

180Cray XMT Sun UltraSparcT2

2.0 GHz quad-core Xeon

Approximate betweenness computation on a synthetic small-world network of 256 million vertices and 2 billion edges

TEPS: Traversed edges per second, performance rate for centrality computation.

BC-new 2.2x faster than previous approach

BC-new 2.4x faster than BC-old

Parallel Graph Algorithms5/4/2009

Page 78: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

78

• New parallel framework for complex graph analysis• 10-100x faster than existing approaches• Can process graphs with billions of vertices and edges• Open-source

Image Source: visualcomplexity.com5/4/2009

snap-graph.sourceforge.net

SNAP: Small-world network analysis and Partitioning

Parallel Graph Algorithms

Page 79: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

79

Graph: 25M vertices and 200M edges, System: Sun Fire T2000

• New graph representations for dynamically evolving small-world networks in SNAP.

• We support fast, parallel structural updates to low-diameter scale-free and small-world graphs.

Number of threads

1 2 4 8 12 16 24 32

Exe

cutio

n tim

e (n

anos

econ

ds p

er u

pdat

e)

0

200

400

600

800

1000

1200

1400

Rel

ativ

e S

peed

up

0

2

4

6

8

10

12

14Execution time per updateRelative Speedup

5/4/2009

SNAP: Compact graph representations for dynamic network analysis

Parallel Graph Algorithms

Page 80: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

80

Number of threads

1 2 4 8 12 16 24 32

Exe

cutio

n tim

e (s

econ

ds)

0

15

30

45

60

75

90

Rel

ativ

e S

peed

up

0

2

4

6

8

10

12Execution timeRelative Speedup

Graph: 500M vertices and 2B edges, System: IBM p5 570 SMP

• Key kernel for dynamic graph computations

• We reduce execution time of linear-work kernels from minutes to seconds for massive small-world networks (billions of vertices and edges)

5/4/2009

SNAP: Induced Subgraphs Performance

Parallel Graph Algorithms

Page 81: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

81

Large-scale Graph Traversal

Problem Graph Result Comments

Multithreaded BFS [BM06]

Random graph, 256M vertices, 1B edges

2.3 sec (40p) 73.9 sec (1p) MTA-2

Processes all low-diameter graph families

External Memory BFS [ADMO06]

Random graph, 256M vertices, 1B edges

8.9 hrs (3.2 GHz Xeon)

State-of-the-art external memory BFS

Multithreaded SSSP [MBBC06]

Random graph, 256M vertices, 1B edges

11.96 sec (40p) MTA-2

Works well for all low-diameter graph families

Parallel Dijkstra [EBGL06]

Random graph, 240M vertices, 1.2B edges

180 sec, 96p 2.0GHz cluster

Best known distributed-memory SSSP implementation for large-scale graphs

Parallel Graph Algorithms5/4/2009

Page 82: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

82

Optimizations for real-world graphs

• Preprocessing kernels (connected components, biconnected components, sparsification) significantly reduce computation time.– ex. A high number of isolated and degree-1 vertices

• store BFS/Shortest Path trees from high degree vertices and reuse them

• Typically 3-5X performance improvement

• Exploit small-world network properties (low graph diameter)– Load balancing in the level-synchronous parallel BFS

algorithm

– SNAP data structures are optimized for unbalanced degree distributions

5/4/2009 Parallel Graph Algorithms

Page 83: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

83

Faster Community Identification Algorithms in SNAP: Performance Improvement over the Girvan-Newman approach

Graphs: Real-world networks (order of Millions), System: Sun Fire T2000

• Speedup from Algorithm Engineering (approximate BC) and parallelization (Sun Fire T2000) are multiplicative!

• 100-300X overall performance improvement over Girvan-Newman approach

Small-world Network

PPI Citations DBLP NDwww Actor

Per

form

ance

Impr

ovem

ent

0

10

20

30

ParallelizationAlgorithm Engineering

5/4/2009 Parallel Graph Algorithms

Page 84: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

84

Large-scale graph analysis: current research

• Synergistic combination of novel graph algorithms, high performance computing, and algorithm engineering.

Dynamic graphalgorithms

Complex Network Analysis &

Empirical studies

Spectral techniques

Data stream algorithms

Classical graph algorithms

Many-core

Graph problems ondynamic network

abstractions

Stream computing

Novel approaches

Realistic modeling

Enabling technologies

Affordable exascale data storage

5/4/2009 Parallel Graph Algorithms

Page 85: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

85

• Applications: Internet and WWW, Scientific computing, Data analysis, Surveillance

• Parallel algorithm building blocks– Kernels: PRAM algorithms, Prefix sums, List ranking

– Data structures: graph representation, priority queues

• Parallel algorithm case studies– Connected components: graft and shortcut

– BFS: level-synchronous approach

– Shortest paths: parallel priority queue

– Betweenness centrality: parallel algorithm with pseudo-code

• Performance on current systems– Software: SNAP, Boost GL, igraph, NetworkX, Network Workbench

– Architectures: Cray XMT, cache-based multicore, SMPs

– Performance trends

Review of lecture

Parallel Graph Algorithms5/4/2009

Page 86: Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

86

• Questions?

Thank you!

5/4/2009 Parallel Graph Algorithms