A Condensation-based Low Communication Linear Systems Solver Utilizing Cramer's Rule

A CONDENSATION-BASED LOW COMMUNICATION LINEAR SYSTEMS SOLVER UTILIZING CRAMER'S RULE

Ken Habgood, Itamar ArelKen Habgood, Itamar ArelDepartment of Electrical Engineering & Computer ScienceDepartment of Electrical Engineering & Computer ScienceThe University of TennesseeThe University of Tennessee

GABRIEL CRAMER (1704-1752)

EECS Department / University of Tennessee EECS Department / University of Tennessee http://mil.engr.utk.eduEECS Department / University of Tennessee EECS Department / University of Tennessee http://mil.engr.utk.edu

Outline2

Motivation & problem statement

Algorithm review Numerical accuracy &

stability Parallel Implementation Communication Results

Source: http://tridane.faculty.asu.edu


Introduction

Mainstream approach: Gaussian Elimination e.g. LU decomposition

Looking for a lower communication overhead, efficient parallel solver

Targeting an unpopular approach: Cramer’s Rule

3


LU Communication Pattern

Source: http://www.caam.rice.edu/~timwar/MA471F03/

Communication for distributed LU decomposition

L00U00

U01 U02

L10 A11 A12

L20 A21 A22

Three sequential steps1. Top left computes and

sends2. Row and column leads

compute and send3. Remaining processors

factorize their blocks One-to-one

communication Idle time while leads

processing

4


Outline5






Proposed Algorithm Flow6


Matrix “Mirroring”

1,42,43,44,4

1,32,33,34,3

1,22,23,24,2

1,12,13,14,1

4,43,42,41,4

4,33,32,31,3

4,23,22,21,2

4,13,12,11,1

aaaa

aaaa

aaaa

aaaa

mirror

aaaa

aaaa

aaaa

aaaa

Mirroring example

1,42,4

1,32,3

4,43,4

4,33,3

''

''

''

''

aa

aa

aa

aa

Applying Chio’s condensation yields:

7


Outline8






Accuracy and Numerical Stability Backward error estimation

Theoretical estimate of rounding error

E matrix depends on two items The largest element in A or b The growth factor of the algorithm

Same growth factor as LU-decomposition with partial pivoting

9


Forward Error Comparisons

Matrix Size

κ(A)

Max Matlab

Max GSL

Avg Matlab

Avg GSL

1000 x 1000 506930 2.39E-09 1.93E-10 1.03E-10 5.38E-12

2000 x 2000 790345 4.52E-09 5.36E-09 1.01E-10 7.27E-12

3000 x 3000 1540152 1.95E-08 1.84E-08 1.12E-10 2.09E-11

4000 x 4000 12760599 4.81E-08 5.62E-08 1.43E-10 7.91E-11

5000 x 5000 765786 2.92E-08 4.39E-08 1.18E-10 3.46E-11

6000 x 6000 1499430 8.67E-08 8.70E-08 1.37E-10 6.04E-11

7000 x 7000 3488010 9.92E-08 8.95E-08 1.27E-10 5.15E-11

8000 x 8000 8154020 9.09E-08 9.43E-08 1.86E-10 7.85E-11

10


Forward Error - Residual

Matrix Size κ(A)

Max Residual

Avg Residual

1000 x 1000 506930 3.14E-08 4.46E-09

2000 x 2000 790345 6.72E-09 9.48E-10

3000 x 3000 1540152 2.79E-08 3.28E-09

4000 x 4000 12760599 1.06E-05 1.34E-06

5000 x 5000 765786 2.00E-08 2.65E-09

6000 x 6000 1499430 2.95E-08 3.86E-09

7000 x 7000 3488010 1.99E-08 2.44E-09

8000 x 8000 8154020 1.94E-08 2.32E-09

11


MATLAB Matrix Gallery

Special Matrix

Avg Matlab Residual

Matlab Residual

clement — Tridiagonal matrix with zero diagonal entries 1.40E-05 7.43E+133 7.85E+144

lehmer — Symmetric positive definite matrix 2.49E-06 7.20E-09 3.89E-06

circul — Circulant matrix 3.23E-08 1.53E-13 1.04E-09

chebspec — Chebyshev spectral differentiation matrix 9.12E-02 3.74E+04 2.0E-01

lesp — Tridiagonal matrix with real, sensitive eigenvalues 9.56E-11 5.11E-16 7.30E-10

minij — Symmetric positive definite matrix 5.14E-10 1.71E-08 6.59E-06

orthog — Orthogonal and nearly orthogonal matrices 1.03E-07 1.09E-14 2.80E-08

randjorth — Random J-orthogonal matrix 1.55E-04 1.68E-00 1.13E-04

12


Outline13






Serial Performance

Results support the theoretical ~2.5x complexity ratio

14


Algorithm Processing Flow15


Overview of Parallel Implementation

16


Parallel Implementation (cont’)

17


Two phases of parallel communication Parallel Chio’s

Gather Columns

Overall Bandwidth

Communication Complexity

N

k

N

k

NNkkkkN

0

log

0

2

2

2

221

2

2

NFN

FP

P

F

NNdoublebytes

2

31

2

3/8

22

2

FPPFFN 12 /log2

N: Original matrix size, P: number of processors, F: gather columns size

18


Communication Overhead19


Point at which Communication “dead time” matches computational workload

Where’s the Breakeven Point?

0

05.1

5.2

322

22223

223

32

3

CdN

CC

CC

pp

NdpNdNp

dpNp

NdN

p

N

Assuming dC = .05 and N = 1000, the breakeven processors point would be P~142

20


Closing Thoughts …

Proposed O(N3) Cramer’s Rule method Significantly lower communications

overhead Many more “broadcasts” than “unicasts” Comm. function of problem size not processors

Next steps … Optimize parallel implementation Spare matrix version

21


Thank you22

Documents

A Condensation-based Low Communication Linear Systems Solver Utilizing Cramer's Rule