Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
© Synopsys 2012 1
Tree Accumulations on GPU
With Applications to
Sparse Linear Algebra Scott Rostrup, Shweta Srivastava,
Kishore Singhal,
Synopsys Inc.
© Synopsys 2012 2
Applications
• Tree Structure Computations
– i.e. subtree size, subtree weight, depth, …
• Sparse Iterative Solvers
– Direct Solve of Tree Matrix (Preconditioning)
• Fast Multipole Method
• DNA/Protein Sequence Alignment
• Tree Drawing
• XML Data Aggregation
© Synopsys 2012 3
Overview
• Up to 10X Improvement over CPU
• On a 4GB GPU card:
– Can accumulate trees up to 45M vertices
– Can solve Linear systems with up to 25M unknowns
• Performance is logarithmic with respect to tree depth
© Synopsys 2012 4
Tree Accumulation
• Data stored at Tree Nodes
• Same operator applied to each
edge
– i.e. +, x, max, min, …
– Associative and
Commutative required for
parallel computation
• Compute and Store all
intermediate results
3
2 1
3 5
2 1 4
max max max
max max
max max
© Synopsys 2012 5
max
max
max
Upwards Tree Accumulation
• Leaf -> Root
• Finds Maximum
Descendent
Value
3
2 1
3 5
2 1 4
max max
max
max
© Synopsys 2012 6
max
max
max
Upwards Tree Accumulation
• Leaf -> Root
• Finds Maximum
Descendent
Value
5
5 1
3 5
2 1 4
max max
max
max
© Synopsys 2012 7
max max max
max max
max max
Downwards Tree Accumulation
• Root -> Leaf
• Finds Maximum
Ancestor Value
3
2 1
3 5
2 1 4
© Synopsys 2012 8
max max max
max max
max max
Downwards Tree Accumulation
• Root -> Leaf
• Finds Maximum
Ancestor Value
3
3 3
3 5
3 5 5
© Synopsys 2012 9
Application:
Symmetric Tree + Star Graph Matrix
• Symmetric Matrix A
– Consider n×n matrix A with elements aij
– Symmetric: aij = aji
– Positive Definite
– Graph of A is a tree + star graph
• Admits Zero Fill-In Cholesky Factorization with suitable vertex
ordering
• A = LDLT
– D is diagonal, L is lower triangular
• Used in Tree-based Preconditioning
0AxxT 0.., xtsx
© Synopsys 2012 10
Tree Matrix: Lower Triangular Solve
6
5
4
3
2
1
6
5
4
3
2
1
5646
35
2414
1
1
1
1
1
1
b
b
b
b
b
b
x
x
x
x
x
x
ll
l
ll
b6
b4 b5
b3 b1 b2
l46 l56
l24 l35 l14
Tree Representation:
• Each row/column becomes a node.
• Each off-diagonal in L becomes an edge weight.
bLx
© Synopsys 2012 11
Upwards Weighted Edge Accumulation 1
2 3
3
2 5
1 2 2
2 1
3 2 1
2 2
3
Before 64
45 9
3
2 5
11 2 6
2 1
3 2 1
2 2
3
After
2
2
2 2 2 2
Edge Operator
2
6
2
© Synopsys 2012 12
Lower Triangular Solve
z6
z4 z5
z3 z1 z2
l46 l56
l24 l35 l14
1. Initialize z=b
2. Accumulate Upwards
yxLT
zDy
bLz
© Synopsys 2012 13
Upper Triangular Solve
1. Initialize x=D-1z
2. Accumulate Downwards
yxLT
zDy
bLz
x6
x4 x5
x3 x1 x2
l46 l56
l24 l35 l14
© Synopsys 2012 14
Parallel Tree Accumulation
• Sequence of Parallel Steps
– Parallel Step: SpMV
– Control Flow: Parallel Scan
• Operator Properties
– Associative, Commutative
– Distributive for linear algebra: ex. + and x
• Efficient when tree structure used many times
GPU
© Synopsys 2012 15
Parallel Step
9
8
7
6
5
4
3
2
1
9
8
7
6
5
4
3
2
1
12
1
1432
1
1
1
132
1
1
b
b
b
b
b
b
b
b
b
x
x
x
x
x
x
x
x
xb3
b1 b2
3 2
b7
b4 b5
3 2
b6
4
b9
b8
2
Child to Parent Reduction
Segmented Reduction
(i.e. SpMV) Parallel Reduce
© Synopsys 2012 16
Parallel Step
9
8
7
6
5
4
3
2
1
9
8
7
6
5
4
3
2
1
1
21
1
41
31
21
1
31
21
b
b
b
b
b
b
b
b
b
x
x
x
x
x
x
x
x
xb3
b1 b2
3 2
b7
b4 b5
3 2
b6
4
b9
b8
2
Parent to Child Scatter
No Reduction
(Scatter: x and +) Parallel Scatter
© Synopsys 2012 17
Control Flow
• Scan between
levels
• Blelloch parallel
scan algorithm1
• Scan algorithm
defines the
matrices
Parallel Scan Algorithm
b6
b4 b5
b3 b1 b2
l46 l56
l24 l35 l14
Level 1
Level 2
Level 3
1 Guy E. Blelloch. “Prefix Sums and Their Applications”. In John H. Reif (Ed.), Synthesis
of Parallel Algorithms, Morgan Kaufmann, 1990.
© Synopsys 2012 18
Path Compression
• Allows jumps between
non-adjacent levels
• Precompute combined
edge weights
• Works because
multiplication and
addition are distributive
b6
b4 b5
b3 b1 b2
l46 l56
l24 l35 l14
l14 l46
Level 1
Level 2
Level 3
© Synopsys 2012 19
Reduction Phase
• k = log2(length)
• There are k
reduction phases
• At all power of 2
positions, the final
value is computed
1
2
3
4
7
6
8
5
3
7
11
15
10
26 36
Blelloch Addition Example
© Synopsys 2012 20
Distribute Phase Blelloch Addition Example
• There are k
distribution phases
• Values at multiple
of 2m indices are
distributed to the
other positions
– m=1,2,…,k
36
1
3
3
10
7
11
5
21
6
28
15
10
3
© Synopsys 2012 21
Explicit Blelloch Matrices
1
2
3
4
7
6
8
5
3
7
11
15
Each Circled Edge
is an entry in the
Matrix
Communication
Between Levels
© Synopsys 2012 22
Sequence of Matrix Vector Products
• Reduction Matrices:
• Distribution Matrices:
Each step is a sparse matrix
M1
M2
M3
N1
N2
N3
},,,{ 21 kMMM
},,,{ 21 kNNN
© Synopsys 2012 23
Upwards Accumulation
𝑴𝟏 𝑵𝟏
z = b for i in 1,2, … ,k: z = SpMV(Mi,z) for i in k,k-1, … ,1: z = SpMV(Ni,z)
yxLT
zDy
bLz
M1
M2
M3
N1
N2
N3
© Synopsys 2012 24
Downwards Accumulation
𝑵𝟏𝑻
Downwards Optimization:
spmv -> scatter,+,x
No reductions down the tree
x = y for i in 1,2, … ,k: x = SpMV(Ni,x) for i in k,k-1, … ,1: x = SpMV(Mi,x)
T
T
yxLT
zDy
bLz
M1T
M2T
M3T
N1T
N2T
N3T
N2T
© Synopsys 2012 25
Algorithmic Preprocessing Details
• Factorization is on CPU
• Euler Tour Technique + List Ranking
– Parent Relationships
– Vertex Depth
• Pointer Jumping (Downwards Accumulation)
– Compressed Edge Weights
• All implemented using thrust library primitives
– Scan, radix sort
© Synopsys 2012 26
Test Set-Up
• Machines
– Tesla T10 GPU with 4GB Memory (1/4 S1070)
– Intel Nehalem @ 2.8 GHz
© Synopsys 2012 27
Test Set-Up
• Random Tree Generator
– Each Vertex has between 0 and 2*c-1 children
– Depth Factor scales between 0 and 1
Uniform Tree Linear Tree
0 1
© Synopsys 2012 28
CPU vs. GPU
Tree properties:
• Average children per vertex is 3
• Depth Factor = 0.5 -> depth is approximately (number of vertices)/6
Downwards
Accumulation
Upwards
Accumulation
© Synopsys 2012 29
Tree vs. Generic Solve (cuSPARSE)
Tree properties:
• Average of 3 children per vertex
• Depth Factor (DF) = 0.05 -> depth is approximately n/60
• Depth Factor = 0 -> depth < 20; Best Case for cuSPARSE
Initialization Solve
cuSparse : DF 0.05
cuSparse : DF 0.0
Tree Solve : DF 0.05
Tree Solve : DF 0.0 Tree Solve performance
insensitive to depth
© Synopsys 2012 30
Minimum Spanning Tree Solve Solving a reduced system Graphs from Florida
Sparse Matrix Collection Initialization Solve
G
T
5X-100X Speedup
© Synopsys 2012 31
Conclusions
• Up to 10X faster than CPU
• Significantly more efficient than using a generic solver
– Greater than 5X faster for all instances tested
– Runtime unaffected by tree depth
• On a 4GB GPU card:
– Can accumulate trees up to 45M vertices
– Can solve Linear systems with up to 25M unknowns
• Data-Parallel Primitives are effective for building irregular
algorithms
© Synopsys 2012 32
Thank You
© Synopsys 2012 33
List of Graphs From Florida Sparse
Matrix Collection
• roadNet-TX
• roadNet-PA
• roadNet-CA
• RM07R
• wikipedia-20070206
• wb-edu
• thermomech_dK
• soc-LiveJournal1
• af_shell10
• atmosmodl
• audikw_1
• bone010
• bone010_M
• circuit5M
• Freescale1
• kkt_power
• mouse_gene
• nlpkkt120