Upload
lynne
View
73
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Parallel Graph Algorithms. Kamesh Madduri [email protected]. Talk Outline. Applications Parallel algorithm building blocks Kernels Data structures Parallel algorithm case studies Connected components BFS/Shortest paths Betweenness Centrality Performance on current systems Software - PowerPoint PPT Presentation
Citation preview
Parallel Graph Algorithms
Kamesh [email protected]
2
• Applications• Parallel algorithm building blocks
– Kernels– Data structures
• Parallel algorithm case studies– Connected components– BFS/Shortest paths– Betweenness Centrality
• Performance on current systems– Software– Architectures– Performance trends
Talk Outline
Parallel Graph Algorithms5/4/2009
3
Road networks, Point-to-point shortest paths: 15 seconds (naïve) 10 microseconds
Routing in transportation networks
Parallel Graph Algorithms5/4/2009
H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”, Science 27, 2007.
4
• The world-wide web can be represented as a directed graph– Web search and crawl: traversal– Link analysis, ranking: Page rank and HITS– Document classification and clustering
• Internet topologies (router networks) are naturally modeled as graphs
Internet and the WWW
Parallel Graph Algorithms5/4/2009
5
• Reorderings for sparse solvers– Fill reducing orderings
• partitioning, traversals, eigenvectors– Heavy diagonal to reduce pivoting (matching)
• Data structures for efficient exploitation of sparsity• Derivative computations for optimization
– Matroids, graph colorings, spanning trees• Preconditioning
– Incomplete Factorizations– Partitioning for domain decomposition– Graph techniques in algebraic multigrid
• Independent sets, matchings, etc.– Support Theory
• Spanning trees & graph embedding techniques
Scientific Computing
B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”, http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf
Parallel Graph Algorithms5/4/2009
6
• Graph abstractions are very useful to analyze complex data sets.
• Sources of data: petascale simulations, experimental devices, the Internet, sensor networks
• Challenges: data size, heterogeneity, uncertainty, data quality
Large-scale data analysis
Astrophysics: massive datasets, temporal variations
Bioinformatics: data quality, heterogeneity
Social Informatics: new analytics challenges, data uncertainty
Parallel Graph Algorithms5/4/2009
Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com
7
• Study of the interactions between various components in a biological system
• Graph-theoretic formulations are pervasive:– Predicting new interactions:
modeling– Functional annotation of novel
proteins: matching, clustering– Identifying metabolic pathways:
paths, clustering– Identifying new protein complexes:
clustering, centrality
5/4/2009
Data Analysis and Graph Algorithms in Systems Biology
Parallel Graph Algorithms
Image Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003.
8
Image Source: Nexus (Facebook application)
5/4/2009 Parallel Graph Algorithms
Graph –theoretic problems in social networks
– Community identification: clustering– Targeted advertising: centrality– Information spreading: modeling
9
• [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain information
• Plot masterminds correctly identified from interaction patterns: centrality
• A global view of entities is often more insightful
• Detect anomalous activities by exact/approximate graph matching
Image Source: http://www.orgnet.com/hijackers.html
Network Analysis for Intelligence and Survelliance
5/4/2009 Parallel Graph Algorithms
Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47
10
Characterizing Graph-theoretic computations
• graph sparsity (m/n ratio)• static/dynamic nature• weighted/unweighted, weight distribution• vertex degree distribution• directed/undirected• simple/multi/hyper graph• problem size• granularity of computation at nodes/edges• domain-specific characteristics
• paths• clusters• partitions• matchings• patterns• orderings
Input: Graph abstraction
Problem: Find ***
Factors that influence choice of algorithm
Graph kernel
• traversal• shortest path algorithms• flow algorithms• spanning tree algorithms• topological sort …..
Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations
Parallel Graph Algorithms5/4/2009
11
• Applications• Parallel algorithm building blocks
– Kernels– Data structures
• Parallel algorithm case studies– Connected components– BFS/Shortest paths– Betweenness centrality
• Performance on current systems– Software– Architectures– Performance trends
Talk Outline
Parallel Graph Algorithms5/4/2009
12
• Objectives– Bridge between software and hardware
• General purpose HW, scalable HW• Transportable SW
– Abstract architecture for algorithm development• Why is it so important?
– Uniprocessor: von Neumann model of computation– Parallel processors Multicore
• Requirements: inherent tension– Simplicity to make analysis of interesting problems tractable– Detailed to reveal the important bottlenecks
• Models, e.g.:– PRAM: rich collection of parallel graph algorithms– BSP: some CGM algorithms (cgmGraph)– LogP
Parallel Computing Models: A Quick PRAM review
Parallel Graph Algorithms5/4/2009
13
• Ideal model of a parallel computer for analyzing the efficiency of parallel algorithms.
• PRAM composed of– P unmodifiable programs, each composed of optionally labeled
instructions.– a single shared memory composed of a sequence of words,
each capable of containing an arbitrary integer.– P accumulators, one associated with each program– a read-only input tape– a write-only output tape
• No local memory in each RAM.• Synchronization, communication, parallel overhead is
zero.
PRAM
5/4/2009 Parallel Graph Algorithms
14
• EREW (Exclusive Read, Exclusive Write)– A memory cell can be read or written by at most one
processor per cycle.– Ensures no read or write conflicts.
• CREW (Concurrent Read, Exclusive Write)– Ensures there are no write conflicts.
• CRCW (Concurrent Read, Concurrent Write)– Requires use of some conflict resolution scheme.
PRAM Data Access Forms
5/4/2009 Parallel Graph Algorithms
15
• Pros– Simple and clean semantics.– The majority of theoretical parallel algorithms are
specified with the PRAM model.– Independent of the communication network topology.
• Cons– Not realistic, too powerful communication model.– Algorithm designer is misled to use IPC without
hesitation.– Synchronized processors.– No local memory.– Big-O notation is often misleading.
PRAM Pros and Cons
5/4/2009 Parallel Graph Algorithms
16
• Problem parameters: n, m, D (graph diameter)
• Worst-case running time: T• Total number of operations (work): W• Nick’s Class (NC): complexity class for
problems that can be solved in poly-logarithmic time using a polynomial number of processors
• P-complete: inherently sequential
Analyzing Parallel Graph Algorithms
5/4/2009 Parallel Graph Algorithms
17
• Extension to the PRAM model for shared memory algorithm design and analysis.
• T(n, p) is measured by the triplet –TM(n, p), TC(n, p), B(n, p)– TM(n, p): maximum number of non-contiguous main
memory accesses required by any processor– TC(n, p): upper bound on the maximum local
computational complexity of any of the processors– B(n, p): number of barrier synchronizations.
The Helman-JaJa model
5/4/2009 Parallel Graph Algorithms
18
• Prefix sums• List ranking
– Euler tours, Pointer jumping, Symmetry breaking
• Sorting• Tree contraction
Building blocks of classical PRAM graph algorithms
5/4/2009 Parallel Graph Algorithms
19
Prefix Sums
5/4/2009 Parallel Graph Algorithms
• Input: A, an array of n elements; associative binary operation
• Output: nijAi
j
1for )(1
B(0,1)C(0,1)
B(0,2)C(0,2)
B(0,3)C(0,3)
B(0,4)C(0,4)
B(0,5)C(0,5)
B(0,6)C(0,6)
B(0,7)C(0,7)
B(0,8)C(0,8)
B(1,1)C(1,1)
B(1,2)C(1,2)
B(1,3)C(1,3)
B(1,4)C(1,4)
B(2,1)C(2,1)
B(2,2)C(2,2)
B(3,1)C(3,2)
O(n) work, O(log n) time, n processors
20
• X: array of n elements stored in arbitrary order.• For each element i, let X(i).value be its value and X(i).next be the
index of its successor.• For binary associative operator Θ, compute X(i).prefix such that
– X(head).prefix = X (head).value, and– X(i).prefix = X(i).value Θ X(predecessor).prefixwhere– head is the first element– i is not equal to head, and– predecessor is the node preceding i.
• List ranking: special case of parallel prefix, values initially set to 1, and addition is the associative operator.
Parallel Prefix
5/4/2009 Parallel Graph Algorithms
21
• Ordered list (X.next values)
• Random list (X.next values)
List ranking Illustration
2 3 4 5 6 7 8 9
4 6 5 7 8 3 2 9
5/4/2009 Parallel Graph Algorithms
22
1. Chop X randomly into s pieces2. Traverse each piece using a serial algorithm.3. Compute the global rank of each element using
the result computed from the second step.
• Locality (list ordering) determines performance
• In the Helman-JaJa model, TM(n,p) = O(n/p).
List Ranking key idea
5/4/2009 Parallel Graph Algorithms
23
Tarjan-Vishkin’s biconnected components algorithm: O(log n) time, O(m+n) time.
1.Compute spanning tree T for the input graph G.2.Compute Eulerian circuit for T.3.Root the tree at an arbitrary vertex.4.Preorder numbering of all the vertices.5.Label edges using vertex numbering6.Connected components using the Shiloach-
Vishkin algorithm
An example higher-level algorithm
5/4/2009 Parallel Graph Algorithms
24
• Dense graphs (m = O(n2)): adjacency matrix commonly used.
• Sparse graphs: adjacency lists, similar to the CSR matrix format.
• Dynamic sparse graphs: we need to support edge and vertex membership queries, insertions, and deletions.– should be space-efficient, with low synchronization
overhead• Several different representations possible
– Resizable adjacency arrays– Adjacency arrays, sorted by vertex identifiers– Adjacency arrays for low-degree vertices, heap-based
structures for high-degree vertices (for sparse graphs with skewed degree distributions)
Data structures: graph representation
5/4/2009 Parallel Graph Algorithms
25
• A wide range of ADTs in graph algorithms: array, list, queue, stack, set, multiset, tree
• ADT implementations are typically array-based for performance considerations.
• Key data structure considerations in parallel graph algorithm design– Practical parallel priority queues– Space-efficiency– Parallel set/multiset operations, e.g., union,
intersection, etc.
Data structures in (Parallel) Graph Algorithms
5/4/2009 Parallel Graph Algorithms
26
• Applications• Parallel algorithm building blocks
– Kernels– Data structures
• Parallel algorithm case studies– Connected components– BFS/Shortest paths– Betweenness centrality
• Performance on current systems– Software– Architectures– Performance trends
Talk Outline
Parallel Graph Algorithms5/4/2009
27
• Building blocks for many graph algorithms– Minimum spanning tree, spanning tree, planarity testing,
etc.• Representative of the “graft-and-shortcut”
approach• CRCW PRAM algorithms
– [Shiloach & Vishkin ’82]: O(log n) time, O((m+n) logn) work
– [Gazit ’91]: randomized, optimal, O(log n) time.• CREW algorithms
– [Han & Wagner ’90]: O(log2n) time, O((m+nlog n) logn) work.
Connected Components
5/4/2009 Parallel Graph Algorithms
28
• Input: n isolated vertices and m PRAM processors.• Each processor Pi grafts a tree rooted at vertex vi to the
tree that contains one of its neighbors u under the constraints u< vi
• Grafting creates k ≥ 1 connected subgraphs, and each subgraph is then shortcut so that the depth of the trees reduce at least by half.
• Repeat graft and shortcut until no more grafting is possible.
• Runs on arbitrary CRCW PRAM in O(logn) time with O(m) processors.
• Helman-JaJa model: TM = (3m/p + 2)log n, TB = 4log n.
Shiloach-Vishkin algorithm
Parallel Graph Algorithms5/4/2009
29
• Input: (1) A set of m edges (i, j) given in arbitrary order. (2) Array D[1..n] with D[i] = i
• Output: Array D[1..n] with D[i] being the component to which vertex i belongs.
begin while true do 1. for (i, j) E in parallel ∈ do if D[i]=D[D[i]] and D[j]<D[i] then D[D[i]] = D[j]; 2. for (i, j) E in parallel do∈ if i belongs to a star and D[j]=D[i] then D[D[i]] = D[j]; 3. if all vertices are in rooted stars then exit; for all i in parallel do D[i] = D[D[i]] end
SV pseudo-code
Parallel Graph Algorithms5/4/2009
30
SV Illustration
4
1
2
3
4
1
2
3
Input graph graft
1,4 2,3
shortcut
1 2 1 2 1
1st iter
2nd iter
Parallel Graph Algorithms5/4/2009
31
• Applications• Parallel algorithm building blocks
– Kernels– Data structures
• Parallel algorithm case studies– Connected components– BFS/Shortest paths– Betweenness centrality
• Performance on current systems– Software– Architectures– Performance trends
Talk Outline
Parallel Graph Algorithms5/4/2009
32
Parallel Single-source Shortest Paths (SSSP) algorithms• No known PRAM algorithm that runs in sub-linear time
and O(m+nlog n) work• Parallel priority queues: relaxed heaps [DGST88],
[BTZ98]• Ullman-Yannakakis randomized approach [UY90]• Meyer et al. ∆ - stepping algorithm [MS03]• Distributed memory implementations based on graph
partitioning• Heuristics for load balancing and termination detection
5/4/2009 Parallel Graph Algorithms
K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.
33
∆ - stepping algorithm [MS03]
• Label-correcting algorithm: Can relax edges from unsettled vertices also
• ∆ - stepping: “approximate bucket implementation of Dijkstra’s algorithm”
• ∆: bucket width• Vertices are ordered using buckets
representing priority range of size ƥ Each bucket may be processed in parallel
5/4/2009 Parallel Graph Algorithms
34 5/4/2009 Parallel Graph Algorithms
35
Classify edges as “heavy” and “light”
5/4/2009 Parallel Graph Algorithms
36
Relax light edges (phase) Repeat until B[i] Is empty
5/4/2009 Parallel Graph Algorithms
37
Relax heavy edges. No reinsertions in this step.
5/4/2009 Parallel Graph Algorithms
38
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket∞ ∞ ∞ ∞ ∞ ∞ ∞
∆ = 0.1 (say)
5/4/2009 Parallel Graph Algorithms
39
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket0 ∞ ∞ ∞ ∞ ∞ ∞
Initialization:Insert s into bucket, d(s) = 000
5/4/2009 Parallel Graph Algorithms
40
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket
0 ∞ ∞ ∞ ∞ ∞ ∞
002R
S
.01
5/4/2009 Parallel Graph Algorithms
41
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket0 ∞ ∞ ∞ ∞ ∞ ∞
2R
0S
.010
5/4/2009 Parallel Graph Algorithms
42
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket0 ∞ .01 ∞ ∞ ∞ ∞
2R
0S0
5/4/2009 Parallel Graph Algorithms
43
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket0 ∞ .01 ∞ ∞ ∞ ∞
2R
0S0
1 3.03 .06
5/4/2009 Parallel Graph Algorithms
44
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket0 ∞ .01 ∞ ∞ ∞ ∞
R
0S0
1 3.03 .06
2
5/4/2009 Parallel Graph Algorithms
45
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket0 .03 .01 .06 ∞ ∞ ∞
R
0S0
21 3
5/4/2009 Parallel Graph Algorithms
46
0.01
∆ - stepping algorithm: illustration
1
2
3
4
5
6
0.13
0
0.18
0.15
0.05
0.07
0.23
0.56
0.02
d array0 1 2 3 4 5 6
Buckets
One parallel phasewhile (bucket is non-empty)
i) Inspect light edgesii) Construct a set of “requests”
(R)iii) Clear the current bucketiv) Remember deleted vertices
(S)v) Relax request pairs in R
Relax heavy request pairs (from S)Go on to the next bucket0 .03 .01 .06 .16 .29 .62
R
0S
1
2 1 326
4
5
6
5/4/2009 Parallel Graph Algorithms
47
No. of phases (machine-independent performance count)
Graph Family
Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU SqGrid USAd NE USAt NE
No.
of p
hase
s
10
100
1000
10000
100000
1000000
low diameter
high diameter
5/4/2009 Parallel Graph Algorithms
48
Average shortest path weight for various graph families~ 220 vertices, 222 edges, directed graph, edge weights normalized to [0,1]
Graph Family
Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU SqGrid USAd NE USAt NE
Ave
rage
sho
rtest
pat
h w
eigh
t
0.01
0.1
1
10
100
1000
10000
100000
5/4/2009 Parallel Graph Algorithms
49
Last non-empty bucket (machine-independent performance count)
Graph Family
Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU SqGrid USAd NE USAt NE
Last
non
-em
pty
buck
et
1
10
100
1000
10000
100000
1000000
Fewer buckets, more parallelism
5/4/2009 Parallel Graph Algorithms
50
Graph Family
Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU SqGrid USAd NE USAt NE
No.
of B
ucke
t ins
ertio
ns
0
2000000
4000000
6000000
8000000
10000000
12000000
Number of bucket insertions (machine-independent performance count)
5/4/2009 Parallel Graph Algorithms
51
• Applications• Parallel algorithm building blocks
– Kernels– Data structures
• Parallel algorithm case studies– Connected components– BFS/Shortest paths– Betweenness centrality
• Performance on current systems– Software– Architectures– Performance trends
Talk Outline
Parallel Graph Algorithms5/4/2009
52
Betweenness Centrality
• Centrality: Quantitative measure to capture the importance of a vertex/edge in a graph– degree, closeness, eigenvalue, betweenness
• Betweenness Centrality
( : No. of shortest paths between s and t)• Applied to several real-world networks
– Social interactions– WWW– Epidemiology– Systems biology
tvs st
st vvBC
)()(
st
5/4/2009 Parallel Graph Algorithms
53
Algorithms for Computing Betweenness
• All-pairs shortest path approach: compute the length and number of shortest paths between all s-t pairs (O(n3) time), sum up the fractional dependency values (O(n2) space).
• Brandes’ algorithm (2003): Augment a single-source shortest path computation to count paths; uses the Bellman criterion; O(mn) work and O(m+n) space.
5/4/2009 Parallel Graph Algorithms
54
• Madduri, Bader (2006): parallel algorithms for computing exact and approximate betweenness centrality – low-diameter sparse graphs (diameter D = O(log n), m = O(nlog
n))– Exact algorithm: O(mn) work, O(m+n) space, O(nD) time
(PRAM model) or O(nD+nm/p) time.
• Madduri et al. (2009): New parallel algorithm with lower synchronization overhead and fewer non-contiguous memory references– In practice, 2-3X faster than previous algorithm– Lock-free => better scalability on large parallel systems
Our New Parallel Algorithms
Parallel Graph Algorithms5/4/2009
55
Parallel BC Algorithm
• Consider an undirected, unweighted graph• High-level idea: Level-synchronous
parallel Breadth-First Search augmented to compute centrality scores
• Two steps– traversal and path counting– dependency accumulation
)(1)()()(
)(
wwvv
wPv
Parallel Graph Algorithms5/4/2009
56
Parallel BC Algorithm Illustration
G (size m+n): read-only, adjacency array representationBC (size n): centrality score of each vertex
• S (n): stack of visited vertices• Visited (n): Mark to check if vertex has been visited• D (n): Distance of vertex from source s• Sigma (n): No. of shortest paths through a vertex• Delta (n): Partial dependence score for each vertex• P (m+n): Multiset of predecessors of a vertex along shortest paths
Space requirement: 8(m+6n) Bytes
Data structures
0 7
5
3
8
2
4 6
1
9
Parallel Graph Algorithms5/4/2009
57
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distanceand path counts.
0 7
5
3
8
2
4 6
1
9source vertex
Parallel Graph Algorithms5/4/2009
58
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distanceand path counts.
0 7
5
3
8
2
4 6
1
9source vertex
2750
0
1
1
1
D
0
0 0
0
S P
Parallel Graph Algorithms5/4/2009
59
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distanceand path counts.
0 7
5
3
8
2
4 6
1
9source vertex
8
2750
0
12
1
12
D
0
0 0
0
S P
3
2 7
5 7
Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel
Parallel Graph Algorithms5/4/2009
60
Parallel BC Algorithm Illustration
1. Traversal step: at the end, we have all reachable vertices,their corresponding predecessor multisets, and D values.
0 7
5
3
8
2
4 6
1
9source vertex
21648
2750
0
12
1
12
D
0
0 0
0
S P
3
2 7
5 7
Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel
3 8
8
6
6
Parallel Graph Algorithms5/4/2009
61
Step 1 (traversal) pseudo-codefor all vertices u at level d in parallel do
for all adjacencies v of u in parallel dodv = D[v];if (dv < 0) // v is visited for the first time vis = fetch_and_add(&Visited[v], 1); if (vis == 0) // v is added to a stack only once D[v] = d+1;
pS[count++] = v; // Add v to local thread stack fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Pcount[v], 1); // Add u to predecessor list
of vif (dv == d + 1) fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Pcount[v], 1); // Add u to predecessor list
of v
v
e1
e2u2
u1
v1u
v2
Parallel Graph Algorithms5/4/2009
62
• Exploit concurrency in visiting adjacencies, as we assume that the graph diameter is small: O(log n)
• Upper bound on size of each predecessor multiset: In-degree
• Potential performance bottlenecks: atomic updates to predecessor multisets, atomic increments of path counts
• New algorithm: Based on observation that we don’t need to store “predecessor” vertices. Instead, we store successor edges along shortest paths.– simplifies the accumulation step– reduces an atomic operation in traversal step– cache-friendly!
Graph traversal step analysis
Parallel Graph Algorithms5/4/2009
63
Modified Step 1 pseudo-codefor all vertices u at level d in parallel do
for all adjacencies v of u in parallel dodv = D[v];if (dv < 0) // v is visited for the first time vis = fetch_and_add(&Visited[v], 1); if (vis == 0) // v is added to a stack only once D[v] = d+1;
pS[count++] = v; // Add v to local thread stack fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); // Add v to successor list of
uif (dv == d + 1) fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); // Add v to successor list of
u
Parallel Graph Algorithms5/4/2009
64
Graph Traversal Step locality analysisfor all vertices u at level d in parallel do
for all adjacencies v of u in parallel dodv = D[v];if (dv < 0) vis = fetch_and_add(&Visited[v], 1); if (vis == 0) D[v] = d+1;
pS[count++] = v; fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); if (dv == d + 1) fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1);
All the vertices are in a contiguous block (stack)
All the adjacencies of a vertex are stored compactly (graph rep.)
Store to S[u]
Non-contiguous memory access
Non-contiguous memory access
Non-contiguous memory access
Better cache utilization likely if D[v], Visited[v], sigma[v] are stored contiguously
Parallel Graph Algorithms5/4/2009
65
Parallel BC Algorithm Illustration
2. Accumulation step: Pop vertices from stack, update dependence scores.
0 7
5
3
8
2
4 6
1
9source vertex
21648
2750
Delta
0
0 0
0
S P
3
2 7
5 7
3 8
8
6
6
Parallel Graph Algorithms5/4/2009
)(1)()()(
)(
wwvv
wPv
66
Parallel BC Algorithm Illustration
2. Accumulation step: Can also be done in a level-synchronous manner.
0 7
5
3
8
2
4 6
1
9source vertex
21648
2750
Delta
0
0 0
0
S P
3
2 7
5 7
3 8
8
6
6
Parallel Graph Algorithms5/4/2009
)(1)()()(
)(
wwvv
wPv
67
Step 2 (accumulation) pseudo-code
for level d = GraphDiameter to 2 dofor all vertices w at level d in parallel do
for all v in P[w] do acquire_lock(v); delta[v] = delta[v] + (1 + delta[w]) * sigma(v)/sigma(w); release_lock(v);
BC[v] = delta[v] )(1)()()(
)(
wwvv
wPv
Parallel Graph Algorithms5/4/2009
68
Modified Step 2 pseudo-code (w/ successor lists)
for level d = GraphDiameter-2 to 1 dofor all vertices v at level d in parallel do
for all w in S[v] in parallel do reduction(delta) delta[v] = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w];
BC[v] = delta[v] )(1)()()(
)(
wwvv
vSw
Parallel Graph Algorithms5/4/2009
69
Accumulation step locality analysis
for level d = GraphDiameter-2 to 1 dofor all vertices v at level d in parallel do
for all w in S[v] in parallel do reduction(delta) delta_sum_v = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w];
BC[v] = delta[v] = delta_sum_v;
All the vertices are in a contiguous block (stack)
Each S[v] is a contiguous block
Only floating point operation in code
Parallel Graph Algorithms5/4/2009
70
Digression: Centrality Analysis applied to Protein Interaction Networks
Human Genome core protein interactionsDegree vs. Betweenness Centrality
Degree
1 10 100
Bet
wee
nnes
s C
entra
lity
1e-7
1e-6
1e-5
1e-4
1e-3
1e-2
1e-1
1e+0
43 interactionsProtein Ensembl ID
ENSG00000145332.2Kelch-like protein 8
5/4/2009 Parallel Graph Algorithms
71
• Applications• Parallel algorithm building blocks
– Kernels– Data structures
• Parallel algorithm case studies– Connected components– BFS/Shortest paths– Betweenness centrality
• Performance on current systems– Software– Architectures– Performance trends
Talk Outline
Parallel Graph Algorithms5/4/2009
72
• Information networks are very different from graph topologies and computations that arise in scientific computing.
• Classical graph algorithms typically assume a uniform random graph topology.
Informatics: dynamic, high-dimensional data
Static networks, Euclidean topologies
Image Sources: visualcomplexity.com (1,2), MapQuest (3)
Graph topology matters
5/4/2009 Parallel Graph Algorithms
73
• Information networks are typically dynamic graph abstractions, from diverse data sources
• High-dimensional data• Skewed (“power law”) degree
distribution of the number of neighbors
• Low graph diameter• Massive networks (billions of
entities)
Image Source: Seokhee Hong
5/4/2009
“Small-world” complex networks
Kevin Bacon and six degrees of separation
Parallel Graph Algorithms
Human Protein Interaction data setDegree distribution (18669 proteins, 43568 interactions)
Degree
1 10 100 1000
Freq
uenc
y
0.1
1
10
100
1000
10000
74
• Execution time is dominated by latency to main memory– Large memory footprint– Large number of irregular memory accesses
• Essentially no computation to hide memory costs• Poor performance on current cache-based
architectures (< 5-10% of peak achieved)• Memory access pattern is dependent on the
graph topology.• Variable degrees of concurrency in parallel
algorithm
5/4/2009
Implementation Challenges
Parallel Graph Algorithms
75
Desirable HPC Architectural Features• A global shared memory abstraction
– no need to partition the graph– support dynamic updates
• A high-bandwidth, low-latency network• Ability to exploit fine-grained parallelism• Support for light-weight synchronization
• HPC systems with these characteristics – Massively Multithreaded architectures– Symmetric multiprocessors
5/4/2009 Parallel Graph Algorithms
76
• Latency tolerance by massive multithreading
– hardware support for 128 threads on each processor
– Globally hashed address space– No data cache – Single cycle context switch– Multiple outstanding memory requests
• Support for fine-grained, word-level synchronization• 16 x 500 MHz processors, 128 GB RAM
Performance Results: Test Platforms
Parallel Graph Algorithms5/4/2009
Cray XMT Sun Fire T5120
• Sun Niagara2: Cache-based multicore server with chip multithreading
• 1 socket x 8 cores x 8 threads per core• 8 KB private L1 cache per core, 4 MB
shared L2 cache• 1167 MHz processor, 32 GB RAM
77
Betweenness Centrality Performance
Number of processors/cores
1 2 4 8 12 16
Bet
wee
nnes
s TE
PS
rate
(Mill
ions
of e
dges
per
sec
ond)
0
20
40
60
80
100
120
140
160
180Cray XMT Sun UltraSparcT2
2.0 GHz quad-core Xeon
Approximate betweenness computation on a synthetic small-world network of 256 million vertices and 2 billion edges
TEPS: Traversed edges per second, performance rate for centrality computation.
BC-new 2.2x faster than previous approach
BC-new 2.4x faster than BC-old
Parallel Graph Algorithms5/4/2009
78
• New parallel framework for complex graph analysis• 10-100x faster than existing approaches• Can process graphs with billions of vertices and edges• Open-source
Image Source: visualcomplexity.com5/4/2009
snap-graph.sourceforge.net
SNAP: Small-world network analysis and Partitioning
Parallel Graph Algorithms
79
Graph: 25M vertices and 200M edges, System: Sun Fire T2000
• New graph representations for dynamically evolving small-world networks in SNAP.
• We support fast, parallel structural updates to low-diameter scale-free and small-world graphs.
Number of threads
1 2 4 8 12 16 24 32
Exe
cutio
n tim
e (n
anos
econ
ds p
er u
pdat
e)
0
200
400
600
800
1000
1200
1400
Rel
ativ
e S
peed
up
0
2
4
6
8
10
12
14Execution time per updateRelative Speedup
5/4/2009
SNAP: Compact graph representations for dynamic network analysis
Parallel Graph Algorithms
80
Number of threads
1 2 4 8 12 16 24 32
Exe
cutio
n tim
e (s
econ
ds)
0
15
30
45
60
75
90
Rel
ativ
e S
peed
up
0
2
4
6
8
10
12Execution timeRelative Speedup
Graph: 500M vertices and 2B edges, System: IBM p5 570 SMP
• Key kernel for dynamic graph computations
• We reduce execution time of linear-work kernels from minutes to seconds for massive small-world networks (billions of vertices and edges)
5/4/2009
SNAP: Induced Subgraphs Performance
Parallel Graph Algorithms
81
Large-scale Graph Traversal
Problem Graph Result Comments
Multithreaded BFS [BM06]
Random graph, 256M vertices, 1B edges
2.3 sec (40p) 73.9 sec (1p) MTA-2
Processes all low-diameter graph families
External Memory BFS [ADMO06]
Random graph, 256M vertices, 1B edges
8.9 hrs (3.2 GHz Xeon)
State-of-the-art external memory BFS
Multithreaded SSSP [MBBC06]
Random graph, 256M vertices, 1B edges
11.96 sec (40p) MTA-2
Works well for all low-diameter graph families
Parallel Dijkstra [EBGL06]
Random graph, 240M vertices, 1.2B edges
180 sec, 96p 2.0GHz cluster
Best known distributed-memory SSSP implementation for large-scale graphs
Parallel Graph Algorithms5/4/2009
82
Optimizations for real-world graphs
• Preprocessing kernels (connected components, biconnected components, sparsification) significantly reduce computation time.– ex. A high number of isolated and degree-1 vertices
• store BFS/Shortest Path trees from high degree vertices and reuse them
• Typically 3-5X performance improvement
• Exploit small-world network properties (low graph diameter)– Load balancing in the level-synchronous parallel BFS
algorithm– SNAP data structures are optimized for unbalanced
degree distributions
5/4/2009 Parallel Graph Algorithms
83
Faster Community Identification Algorithms in SNAP: Performance Improvement over the Girvan-Newman approach
Graphs: Real-world networks (order of Millions), System: Sun Fire T2000
• Speedup from Algorithm Engineering (approximate BC) and parallelization (Sun Fire T2000) are multiplicative!
• 100-300X overall performance improvement over Girvan-Newman approach
Small-world Network
PPI Citations DBLP NDwww Actor
Per
form
ance
Impr
ovem
ent
0
10
20
30
ParallelizationAlgorithm Engineering
5/4/2009 Parallel Graph Algorithms
84
Large-scale graph analysis: current research
• Synergistic combination of novel graph algorithms, high performance computing, and algorithm engineering.
Dynamic graphalgorithms
Complex Network Analysis &
Empirical studies
Spectral techniques
Data stream algorithms
Classical graph algorithms
Many-core
Graph problems ondynamic network
abstractions
Stream computing
Novel approaches
Realistic modeling
Enabling technologies
Affordable exascale data storage
5/4/2009 Parallel Graph Algorithms
85
• Applications: Internet and WWW, Scientific computing, Data analysis, Surveillance
• Parallel algorithm building blocks– Kernels: PRAM algorithms, Prefix sums, List ranking– Data structures: graph representation, priority queues
• Parallel algorithm case studies– Connected components: graft and shortcut– BFS: level-synchronous approach– Shortest paths: parallel priority queue– Betweenness centrality: parallel algorithm with pseudo-code
• Performance on current systems– Software: SNAP, Boost GL, igraph, NetworkX, Network Workbench– Architectures: Cray XMT, cache-based multicore, SMPs– Performance trends
Review of lecture
Parallel Graph Algorithms5/4/2009
86
• Questions?
Thank you!
5/4/2009 Parallel Graph Algorithms