29
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Embed Size (px)

Citation preview

Page 1: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Dense Linear Algebra(Data Distributions)

Sathish Vadhiyar

Page 2: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Gaussian Elimination - ReviewVersion 1for each column i zero it out below the diagonal by adding multiples of row i to

later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n add a multiple of row i to row j for k = i to n A(j, k) = A(j, k) – A(j, i)/A(i, i) * A(i, k)

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

Page 3: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Gaussian Elimination - ReviewVersion 2 – Remove A(j, i)/A(i, i) from inner loopfor each column i zero it out below the diagonal by adding multiples of row i to

later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i to n A(j, k) = A(j, k) – m* A(i, k)

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

Page 4: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Gaussian Elimination - ReviewVersion 3 – Don’t compute what we already knowfor each column i zero it out below the diagonal by adding multiples of row i to

later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – m* A(i, k)

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

Page 5: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Gaussian Elimination - ReviewVersion 4 – Store multipliers m below diagonalsfor each column i zero it out below the diagonal by adding multiples of row i to

later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k)

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

Page 6: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

GE - Runtime

Divisions

Multiplications / subtractions

Total

1+ 2 + 3 + … (n-1) = n2/2 (approx.)

12 + 22 + 32 + 42 +52 + …. (n-1)2 = n3/3 – n2/2

2n3/3

Page 7: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Parallel GE

1st step – 1-D block partitioning along blocks of n columns by p processors

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

Page 8: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

1D block partitioning - Steps

1. Divisions

2. Broadcast

3. Multiplications and Subtractions

Runtime:

n2/2

xlog(p) + ylog(p-1) + zlog(p-3) + … log1 < n2logp

(n-1)n/p + (n-2)n/p + …. 1x1 = n3/p (approx.)

< n2/2 +n2logp + n3/p

Page 9: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

2-D block

To speedup the divisions

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

P

Q

Page 10: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

2D block partitioning - Steps

1. Broadcast of (k,k)

2. Divisions

3. Broadcast of multipliers

logQ

n2/Q (approx.)

xlog(P) + ylog(P-1) + zlog(P-2) + …. = n2/Q logP

4. Multiplications and subtractionsn3/PQ (approx.)

Page 11: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Problem with block partitioning for GE

Once a block is finished, the corresponding processor remains idle for the rest of the execution

Solution? -

Page 12: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Onto cyclic The block partitioning algorithms waste

processor cycles. No load balancing throughout the algorithm.

Onto cyclic

0 1 2 3 0 2 3 0 2 31 1 0

cyclic 1-D block-cyclic 2-D block-cyclic

Load balanceLoad balance, block operations, but column factorization bottleneck

Has everything

Page 13: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Block cyclic Having blocks in a processor can lead

to block-based operations (block matrix multiply etc.)

Block based operations lead to high performance

Page 14: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

GE: MiscellaneousGE with Partial Pivoting

1D block-column partitioning: which is better? Column or row pivoting

2D block partitioning: Can restrict the pivot search to limited number of columns

•Column pivoting does not involve any extra steps since pivot search and exchange are done locally on each processor. O(n-i-1)

•The exchange information is passed to the other processes by piggybacking with the multiplier information

• Row pivoting

• Involves distributed search and exchange – O(n/P)+O(logP)

Page 15: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Triangular Solve – Unit Upper triangular matrix

Sequential Complexity - Complexity of parallel algorithm with

2D block partitioning (P0.5*P0.5)

O(n2)

O(n2)/P0.5

Thus (parallel GE / parallel TS) < (sequential GE / sequential TS) Overall (GE+TS) – O(n3/P)

Page 16: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Dense LU on GPUs

Page 17: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

LU for Hybrid Multicore + GPU Systems(Tomov et al., Parallel Computing, 2010)

Assume the CPU host has 8 cores Assume NxN matrix; Divided into

blocks of size NB; Split such that the first N-7NB

columns are on GPU memory Last 7NB on the host

Page 18: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Load Splitting for Hybrid LU

Page 19: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Steps Current panel is downloaded to CPU; the dark

blue part in the figure Panel factored on CPU and result sent to GPU to

update trailing sub-matrix; the red part; GPU updates the first NB columns of the trailing

submatrix Updated panel sent to CPU; Asynchronously

factored on CPU while the GPU updates the rest of the trailing submatrix

The rest of the 7 host cores update the last 7 NB host columns

Page 20: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Dense Cholesky on GPUs

Page 21: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Cholesky Factorization Symmetric positive definite matrix A A = LLT

Similar to Gaussian elimination; succession of block column factorizations followed by updates of the trailing submatrix

Can be parallelized using fork-join model similar to parallel GE

Matrix-matrix multiplications in parallel (fork) and synchronization needed for next block column factorization (join)

Page 22: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Heterogeneous Tile Algorithms (Song et al., ICS 2012) Divide a matrix into a set of small and large

tiles Small tiles for CPU, and large tiles for GPUs At the top level, divide the matrix into large

square tiles of size B x B Then subdivide each top level tile into a

number of small rectangular tiles of size Bxb and a remaining tile

Page 23: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Heterogeneous Tile Cholesky Factorization

Factor the first tile to solve L(1,1)

Apply L(1,1) to update its right A(1,2)

Page 24: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Heterogeneous Tile Cholesky Factorization

Factor the two tiles below L(1,1)

Update all tiles on the right of the first tile column

Page 25: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Heterogeneous Tile Cholesky Factorization

At the second iteration, repeat the above steps on the trailing submatrix starting from the second tile column

Page 26: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Two Level Block-Cyclic Distribution Divide a matrix A into p x (s.p) rectangular tiles

discussed earlier i.e., first divided into pxp large tiles, then

partition each large tile into s tiles On a hybrid CPU and GPU machine, allocate the

tiles to the host and a number of P GPUs in a 1-D block-cyclic way

Those columns whose indices are multiples of s are mapped to the P GPUs in a cyclic way, and remaining columns go to all CPU cores in the host

Page 27: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Two Level Block-Cyclic Distribution - Example

Page 28: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

Tuning Tile Size

Depends on relative performance of CPU and GPU cores

For example, to determine the tile width, Bh for CPU:

Page 29: Dense Linear Algebra (Data Distributions) Sathish Vadhiyar

References E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R.

Namyst, S. Thibault, S. Tomov, and W. m. Hu. Faster, Cheaper, Better: a Hybridization Methodology to Develop Linear Algebra Software for GPUs. GPU Computing Gems, 2, 2010.

F. Song, S. Tomov, and J. Dongarra. Enabling and Scaling Matrix Computations on Heterogeneous Multi-core and Multi-GPU Systems. In In the Proceedings of IEEE/ACM Supercomputing Conference (SC), pages 365{376, 2012.