Tree Accumulations on GPU · 1 6 5 4 3 2 1 46 56 35 14 24 1 1 1 1 1 1 b b b b b b x x x x x x l l l l l b 6 b 4 b 5 b 1 b 2 b 3 l 46 l 56 l 14 l 24 l 35 Tree Representation: • Each

© Synopsys 2012 1

Tree Accumulations on GPU

With Applications to

Sparse Linear Algebra Scott Rostrup, Shweta Srivastava,

Kishore Singhal,

Synopsys Inc.

© Synopsys 2012 2

Applications

• Tree Structure Computations

– i.e. subtree size, subtree weight, depth, …

• Sparse Iterative Solvers

– Direct Solve of Tree Matrix (Preconditioning)

• Fast Multipole Method

• DNA/Protein Sequence Alignment

• Tree Drawing

• XML Data Aggregation

© Synopsys 2012 3

Overview

• Up to 10X Improvement over CPU

• On a 4GB GPU card:

– Can accumulate trees up to 45M vertices

– Can solve Linear systems with up to 25M unknowns

• Performance is logarithmic with respect to tree depth

© Synopsys 2012 4

Tree Accumulation

• Data stored at Tree Nodes

• Same operator applied to each

edge

– i.e. +, x, max, min, …

– Associative and

Commutative required for

parallel computation

• Compute and Store all

intermediate results

3

2 1

3 5

2 1 4

max max max

max max

max max

© Synopsys 2012 5

max

max

max

Upwards Tree Accumulation

• Leaf -> Root

• Finds Maximum

Descendent

Value

3

2 1

3 5

2 1 4

max max

max

max

© Synopsys 2012 6

max

max

max

Upwards Tree Accumulation

• Leaf -> Root

• Finds Maximum

Descendent

Value

5

5 1

3 5

2 1 4

max max

max

max

© Synopsys 2012 7

max max max

max max

max max

Downwards Tree Accumulation

• Root -> Leaf

• Finds Maximum

Ancestor Value

3

2 1

3 5

2 1 4

© Synopsys 2012 8

max max max

max max

max max

Downwards Tree Accumulation

• Root -> Leaf

• Finds Maximum

Ancestor Value

3

3 3

3 5

3 5 5

© Synopsys 2012 9

Application:

Symmetric Tree + Star Graph Matrix

• Symmetric Matrix A

– Consider n×n matrix A with elements aij

– Symmetric: aij = aji

– Positive Definite

– Graph of A is a tree + star graph

• Admits Zero Fill-In Cholesky Factorization with suitable vertex

ordering

• A = LDLT

– D is diagonal, L is lower triangular

• Used in Tree-based Preconditioning

0AxxT 0.., xtsx

© Synopsys 2012 10

Tree Matrix: Lower Triangular Solve

6

5

4

3

2

1

6

5

4

3

2

1

5646

35

2414

1

1

1

1

1

1

b

b

b

b

b

b

x

x

x

x

x

x

ll

l

ll

b6

b4 b5

b3 b1 b2

l46 l56

l24 l35 l14

Tree Representation:

• Each row/column becomes a node.

• Each off-diagonal in L becomes an edge weight.

bLx

© Synopsys 2012 11

Upwards Weighted Edge Accumulation 1

2 3

3

2 5

1 2 2

2 1

3 2 1

2 2

3

Before 64

45 9

3

2 5

11 2 6

2 1

3 2 1

2 2

3

After

2

2

2 2 2 2

Edge Operator

2

6

2

© Synopsys 2012 12

Lower Triangular Solve

z6

z4 z5

z3 z1 z2

l46 l56

l24 l35 l14

1. Initialize z=b

2. Accumulate Upwards

yxLT

zDy

bLz

© Synopsys 2012 13

Upper Triangular Solve

1. Initialize x=D-1z

2. Accumulate Downwards

yxLT

zDy

bLz

x6

x4 x5

x3 x1 x2

l46 l56

l24 l35 l14

© Synopsys 2012 14

Parallel Tree Accumulation

• Sequence of Parallel Steps

– Parallel Step: SpMV

– Control Flow: Parallel Scan

• Operator Properties

– Associative, Commutative

– Distributive for linear algebra: ex. + and x

• Efficient when tree structure used many times

GPU

© Synopsys 2012 15

Parallel Step

9

8

7

6

5

4

3

2

1

9

8

7

6

5

4

3

2

1

12

1

1432

1

1

1

132

1

1

b

b

b

b

b

b

b

b

b

x

x

x

x

x

x

x

x

xb3

b1 b2

3 2

b7

b4 b5

3 2

b6

4

b9

b8

2

Child to Parent Reduction

Segmented Reduction

(i.e. SpMV) Parallel Reduce

© Synopsys 2012 16

Parallel Step

9

8

7

6

5

4

3

2

1

9

8

7

6

5

4

3

2

1

1

21

1

41

31

21

1

31

21

b

b

b

b

b

b

b

b

b

x

x

x

x

x

x

x

x

xb3

b1 b2

3 2

b7

b4 b5

3 2

b6

4

b9

b8

2

Parent to Child Scatter

No Reduction

(Scatter: x and +) Parallel Scatter

© Synopsys 2012 17

Control Flow

• Scan between

levels

• Blelloch parallel

scan algorithm1

• Scan algorithm

defines the

matrices

Parallel Scan Algorithm

b6

b4 b5

b3 b1 b2

l46 l56

l24 l35 l14

Level 1

Level 2

Level 3

1 Guy E. Blelloch. “Prefix Sums and Their Applications”. In John H. Reif (Ed.), Synthesis

of Parallel Algorithms, Morgan Kaufmann, 1990.

© Synopsys 2012 18

Path Compression

• Allows jumps between

non-adjacent levels

• Precompute combined

edge weights

• Works because

multiplication and

addition are distributive

b6

b4 b5

b3 b1 b2

l46 l56

l24 l35 l14

l14 l46

Level 1

Level 2

Level 3

© Synopsys 2012 19

Reduction Phase

• k = log2(length)

• There are k

reduction phases

• At all power of 2

positions, the final

value is computed

1

2

3

4

7

6

8

5

3

7

11

15

10

26 36

Blelloch Addition Example

© Synopsys 2012 20

Distribute Phase Blelloch Addition Example

• There are k

distribution phases

• Values at multiple

of 2m indices are

distributed to the

other positions

– m=1,2,…,k

36

1

3

3

10

7

11

5

21

6

28

15

10

3

© Synopsys 2012 21

Explicit Blelloch Matrices

1

2

3

4

7

6

8

5

3

7

11

15

Each Circled Edge

is an entry in the

Matrix

Communication

Between Levels

© Synopsys 2012 22

Sequence of Matrix Vector Products

• Reduction Matrices:

• Distribution Matrices:

Each step is a sparse matrix

M1

M2

M3

N1

N2

N3

},,,{ 21 kMMM

},,,{ 21 kNNN

© Synopsys 2012 23

Upwards Accumulation

𝑴𝟏 𝑵𝟏

z = b for i in 1,2, … ,k: z = SpMV(Mi,z) for i in k,k-1, … ,1: z = SpMV(Ni,z)

yxLT

zDy

bLz

M1

M2

M3

N1

N2

N3

© Synopsys 2012 24

Downwards Accumulation

𝑵𝟏𝑻

Downwards Optimization:

spmv -> scatter,+,x

No reductions down the tree

x = y for i in 1,2, … ,k: x = SpMV(Ni,x) for i in k,k-1, … ,1: x = SpMV(Mi,x)

T

T

yxLT

zDy

bLz

M1T

M2T

M3T

N1T

N2T

N3T

N2T

© Synopsys 2012 25

Algorithmic Preprocessing Details

• Factorization is on CPU

• Euler Tour Technique + List Ranking

– Parent Relationships

– Vertex Depth

• Pointer Jumping (Downwards Accumulation)

– Compressed Edge Weights

• All implemented using thrust library primitives

– Scan, radix sort

© Synopsys 2012 26

Test Set-Up

• Machines

– Tesla T10 GPU with 4GB Memory (1/4 S1070)

– Intel Nehalem @ 2.8 GHz

© Synopsys 2012 27

Test Set-Up

• Random Tree Generator

– Each Vertex has between 0 and 2*c-1 children

– Depth Factor scales between 0 and 1

Uniform Tree Linear Tree

0 1

© Synopsys 2012 28

CPU vs. GPU

Tree properties:

• Average children per vertex is 3

• Depth Factor = 0.5 -> depth is approximately (number of vertices)/6

Downwards

Accumulation

Upwards

Accumulation

© Synopsys 2012 29

Tree vs. Generic Solve (cuSPARSE)

Tree properties:

• Average of 3 children per vertex

• Depth Factor (DF) = 0.05 -> depth is approximately n/60

• Depth Factor = 0 -> depth < 20; Best Case for cuSPARSE

Initialization Solve

cuSparse : DF 0.05

cuSparse : DF 0.0

Tree Solve : DF 0.05

Tree Solve : DF 0.0 Tree Solve performance

insensitive to depth

© Synopsys 2012 30

Minimum Spanning Tree Solve Solving a reduced system Graphs from Florida

Sparse Matrix Collection Initialization Solve

G

T

5X-100X Speedup

© Synopsys 2012 31

Conclusions

• Up to 10X faster than CPU

• Significantly more efficient than using a generic solver

– Greater than 5X faster for all instances tested

– Runtime unaffected by tree depth

• On a 4GB GPU card:

– Can accumulate trees up to 45M vertices

– Can solve Linear systems with up to 25M unknowns

• Data-Parallel Primitives are effective for building irregular

algorithms

© Synopsys 2012 32

Thank You

© Synopsys 2012 33

List of Graphs From Florida Sparse

Matrix Collection

• roadNet-TX

• roadNet-PA

• roadNet-CA

• RM07R

• wikipedia-20070206

• wb-edu

• thermomech_dK

• soc-LiveJournal1

• af_shell10

• atmosmodl

• audikw_1

• bone010

• bone010_M

• circuit5M

• Freescale1

• kkt_power

• mouse_gene

• nlpkkt120

Documents

Tree Accumulations on GPU · 1 6 5 4 3 2 1 46 56 35 14 24 1 1 1 1 1 1 b b b b b b x x x x x x l l l l l b 6 b 4 b 5 b 1 b 2 b 3 l 46 l 56 l 14 l 24 l 35 Tree Representation: • Each