Recurrences and Tridiagonal solvers: From Classical Tricks ... · Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats ... Recurrences and Tridiagonal solvers:

Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats

Recurrences and Tridiagonal solvers:

From Classical Tricks to Parallel Treats

Stratis Gallopoulos

1HPCLAB

Computer Engineering & Informatics Dept.

University of Patras, Greece

West Lafayette

January 2016



Outline

1 Performance metrics (recap)

2 Linear recurrences and triangular systems

Linear recurrences: examples and matrix interpretation

Recurrences, triangular systems and parallel prefix

3 Tridiagonal systems and solvers

Solving by Marching

Solving by Recursive Doubling (not covered)

Solving by Cyclic Reduction


Performance metrics (recap)

Performance metrics

Chapter [email protected]

Tp runtime on p processors

Op arithmetic operations on p processors

Sp speedup Sp = T1

Tp

Ep efficiency Ep = T1

pTp

Cp parallel cost Cp = pTp

Rp arithmetic redundancy Rp = Op

O1


Linear recurrences and triangular systems


Terminology

Problem: In many appplications we would like to compute the values of

a sequence {ξk , ξk+1, ...}, that is realized as a recurrence

ξn = γn−1ξn−1 + γn−2ξn−2 + · · ·+ γn−mξn−m + δn, where the

coefficients γj , δj are known.

The recurrence is of order m and is homogeneous when δj ≡ 0.

Recurrences can be about scalars, vectors, matrices ...

In order to be able to compute the recurrence of order m it is necesssary to hava

available m initial values, e.g. ξ1, . . . , ξm.

Recurrences are computational kernels for many computations in science and

engineering.

Goal is to compute these values quickly and accurately and implement as

efficient primitives on parallel architectures.

Obstacle: Every new value depends on previous ones and this gives the

appearance of an unavoidable sequential computation.




Example: Definite integrals

Compute:

ψk =

∫1

0

ξk

ξ + 8dξ for k = 0, 1, 2, . . . , n

The following recurrence is easy to derive: ψk + 8ψk−1 = 1

k, for k ≥ 1, where ψ0 = loge(9/8).

Example: Orthogonal polynomials

Chebyshev polynomials of the first kind:

Tk(ξ) = cos(k cos−1 ξ), k = 0, 1, 2, . . . ,

satisfy the 2nd order recurrence (3-term):

Tk+1(ξ)− 2ξTk(ξ) + Tk−1(ξ) = 0

where T0(ξ) = 1, T1(ξ) = ξ.

Example: Horner’s rule

s = a( n ) ;

f o r i = n−1:−1:0

s = x∗s + a( i −1);

end







Linear recurrenceIf we express a linear recurrence for successive n,

ξ1 = φ1, ξi = φi −i−1∑j=k

λijξj , where i = 2, . . . , n and k = max{1, i − m}.

Key: The linear recurrences (??) can be expressed as a lower triangular system:

x = f − L̂x, üðïõ x = (ξ1, ξ2, . . . , ξn)>, f = (φ1, φ2, . . . , φn)

>. (1)

Example: if m = 3, then L̂ is banded lower triangular matrix:

L̂ =

0

λ21 0

λ31 λ32 0

λ41 λ42 λ43 0

λ52 λ53 λ54 0

...

...

...

...

λn,n−3 λn,n−2 λn,n−1 0

.

Setting L = I + L̂ then (1) becomes

Lx = f . (2)

Matrix L is lower triangular with unit diagonal and lower bandwidth m + 1.

Computing the values with forward substitution requires requires T1 = 2mn + O(m2) (arithmetic

operations).




Dense triangular systems and column sweep IIn a "dense recurrence" m ≈ n− 1, so 2) can be expressed as densem unit lower triangular system:

1

λ21 1

λ31 λ32 1

λ41 λ42 λ43 1

.

.

.

.

.

.

.

.

.

.

.

.

λn1 λn2 λn3 λn4 . . . λn,n−1 1

ξ1

ξ2

ξ3

ξ4

.

.

.

ξn

=

φ1

φ2

φ3

φ4

.

.

.

φn

. (3)

Starting from ξ1 = φ1 the right-hand side (3) is purifie/deflated as f − ξ1Le1, in order to form new

r.hs. corresponding to linear system of order (n− 1):

1

λ32 1

λ42 λ43 1

.

.

.

.

.

.

.

.

.

λn2 λn3 λn4 . . . 1

ξ2

ξ3

ξ4

.

.

.

ξn

=

φ(1)2

φ(1)3

φ(1)4

.

.

.

φ(1)n

,

We set φ(1)i = φi − φ1λi1, i = 2, 3, . . . , n. This can be applied recursively.

With (n− 1) processors, the algorithm requires 2(n− 1) parallel steps without any arithmetic

redundancy.

The resulting process is called column-sweep algorithm.




Dense triangular systems and column sweep II

Require: Lower triangular matrix L of order n with unit diagonal, right-hand side f

Ensure: Solution of Lx = f

1: set φ(0)j = φj j = 1, . . . , n {that is f

(0) = f }

2: do i = 1 : n

3: ξi = φ(i−1)i

4: doall j = i + 1 : n

5: φ(i)j = φ

(i−1)j − φ(i−1)

i λj,i {compute f(i) = N

−1

i f(i−1)

}

6: end7: end

Observations: The method exploits the parallelism in each of the O(n) steps of the

sweep. Thus Tp = (n log n).Questions:

Can we break the sequential dependence and obtain poly-logarithmic

computational complexity, e.g. Tp = (logk

n)?

Can this be done with little redundancy and overhead?

Can this be done stably?

Can this be implemented on parallel architectures?




Dense triangular systems: The fan-in approach

Basic results




Dense triangular systems: The fan-in approach

Basic results

Theorem ([email protected] from [CK75],[SB77] )The order-n,unit lower triangular system Lx = f , can be solved in

Tp = 1

2log2 n + 3

2log n parallel arithmetic steps operations using at most

p = 15

1024n3 + O(n2) processors, with redundancy Rp = O(n).

Proof (outline) The basic idea consists of

i) Reorganizing the computation as a binary tree of height O(log n).

ii) Showing that each tree node requires at most O(log n) operations if

sufficient processing power is available.

iii) Proving the number of processors.

Key to proving (i&ii) are properties of the elementary Gauss transforms

used in LU and Gaussian elimination and the product form of L based on

these transforms.




Proof I

Recall that a unit lower triangular matrix can be factorized as

L = N1N2N3 . . .Nn−1,

where

Nj =

1

1

. . .

1

λj+1,j 1

.

.

.. . .

λn,j 1

(4)

Therefore, if Lx = f then

x = N−1

n−1N−1

n−2 . . .N−1

2 N−1

1 f , (5)

where the terms N−1

j are readily available (simple sign reversal of the λ′ij s in (4).




Proof II

M(1)n

2−2

M(1)n

2−1 f

(1)M

(1)1

M(2)n

4−1

f(2)

f(log n) ≡ x

M(0)n−1

M(0)n−2

M(0)n−3

M(0)n−4 · · · M

(0)3 M

(0)2 M

(0)1 f

(0)




Algorithm

Algorithm DTS: Triangular solver based on a fan-in approach

Require: Lower triangular matrix L of order n = 2µ and right-hand side f

Ensure: Solution x of Lx = f

1: f (0) = f , M(0)i = N

−1

i , i = 1, . . . , n− 1 where Nj as in Eq. (4)

2: do j = 0 : µ− 1

3: f (j+1) = M(j)1 f (j)

4: doall k = 1 : (n/2j+1)− 1

5: M(j+1)k = M

(j)2k+1M

(j)2k

6: end7: end




Algorithm

Algorithm DTS: Triangular solver based on a fan-in approach

Require: Lower triangular matrix L of order n = 2µ and right-hand side f

Ensure: Solution x of Lx = f

1: f (0) = f , M(0)i = N

−1

i , i = 1, . . . , n− 1 where Nj as in Eq. (4)

2: do j = 0 : µ− 1

3: f (j+1) = M(j)1 f (j)

4: doall k = 1 : (n/2j+1)− 1

5: M(j+1)k = M

(j)2k+1M

(j)2k

6: end7: end

Numerical issues

care is needed regarding the numerical stability of recurrence solvers

Algorithmic stability: masterly survey by Higham [Hig02]

Additional issue related to conditioning (save for later)




Banded recurrences

Some results (Ch. 3@ GPS.15)

1 Let L be a banded unit lower triangular matrix of bandwidth

m + 1, where m ≤ n/2, and λik = 0 for i − k > m. Then Lx = f

can be solved in less than Tp = (2 + log m) log n parallel steps

using fewer than p = m(m + 1)n/2 processors. [Th 3.2]

2 An algorithm can be designed using

Op = m2n log(n/2m) + O(mn log n) arithmetic operations, and

arithmetic redundancy Rp = O(m log n). [Cor. 3.1]

3 More refined, lower complexity results can be obtained for lowertriangular Toeplitz matrices.

4 If L dense Toeplitz unit lower triangular, then Lx = f can be solved in about

Tp = log2

n + 2 log n parallel steps, witn p ≤ n2/4. [Th. 3.3]

5 If L is also banded with (m + 1) ≥ 3 then the system can be solved in about

Tp = (2 log m + 3) log n with p ≤ 3mn/4. [Th. 3.4]




Recurrences: with prefixes

Let ξi = αiξi−1 + βi and compute by exanding repeatedly:

ξ1 = α1ξ0 + β1






ξ1 = α1ξ0 + β1

ξ2 = α2ξ1 + β2






ξ1 = α1ξ0 + β1

ξ2 = α2(α1ξ0 + β1) + β2






ξ1 = α1ξ0 + β1

ξ2 = α2α1ξ0 + α2β1 + β2






ξ1 = α1ξ0 + β1

ξ2 = α2α1ξ0 + α2β1 + β2

ξ3 = α3ξ2 + β3






ξ1 = α1ξ0 + β1

ξ2 = α2α1ξ0 + α2β1 + β2

ξ3 = α3(α2ξ1 + β2) + β3






ξ1 = α1ξ0 + β1

ξ2 = α2α1ξ0 + α2β1 + β2

ξ3 = α3(α2(α1ξ0 + β1) + β2) + β3






ξ1 = α1ξ0 + β1

ξ2 = α2α1ξ0 + α2β1 + β2

ξ3 = α3α2α1ξ0 + α3α2β1 + α3β2 + β3






ξ1 = α1ξ0 + β1

ξ2 = α2α1ξ0 + α2β1 + β2

ξ3 = α3α2α1ξ0 + α3α2β1 + α3β2 + β3

.

.

.. . .

ξn = αn · · ·α1ξ0 + αn · · ·α2β1 + · · ·+ αnαn−1βn−2 + αnβn−1 + βn






ξ1 = α1ξ0 + β1

ξ2 = α2α1ξ0 + α2β1 + β2

ξ3 = α3α2α1ξ0 + α3α2β1 + α3β2 + β3

.

.

.. . .


We can write

ξn = (αn · · ·α1 αn · · ·α2 ... αnαn−1 αn 1)

ξ0

β1

.

.

.

βn−2

βn−1

βn




Recurrences: with prefixesLet ξi = αiξi−1 + βi and compute by exanding repeatedly:


We can write

ξn = (αn · · ·α1 αn · · ·α2 ... αnαn−1 αn 1)

ξ0

β1

.

.

.

βn−2

βn−1

βn

Observe: Important role played by the products

αn

αnαn−1

. . .

αnαn−1 · · ·α2

αnαn−1 · · ·α2α1




Prefixes

These terms are called prefixes with respect to multplication or simply

prefix products of the (ordered) sequence

(αn, αn−1, . . . , α2, α1)

How to do this efficiently in parallel?

Computing prefixes of sequences with respect to some associative

operation is a kernel computation in many applications.

It comes under various names and flavors

Depending on the operation we can have prefix sums

Sometimes natural to refer to suffixes.

In the literature (and computer manuals) another term is scan.

The sequence could also consist of compatible vectors and

matrices of compatible dimensions




Parallel prefix problem

Given A = (a1, ..., an) and an associative binary operation �, compute the n

prefixes:

a1

a1 � a2

a1 � a2 � a3

· · · · · · · · · · · ·a1 � a2 � · · · � an.

On 1 processor, the algorithm is obvious:

s1 ← a1; for i = 2 : n, si = si−1 � ai , end.

Parallel idea: Apply simultaneous reductions to groups of elements.

With n/2 processors compute sn in dlog ne parallel operations �.

with (n− 1)/2 processors, sn−1, etc.

⇒ with O(n2) processors compute all prefixes in Tp = O(log n) with p = O(n2).

Speedup but need O(n2log n) rather than O(n) operations.




Parallel prefix problem

Given A = (a1, ..., an) and an associative binary operation �, compute the n

prefixes:

a1

a1 � a2

a1 � a2 � a3

· · · · · · · · · · · ·a1 � a2 � · · · � an.

On 1 processor, the algorithm is obvious:

s1 ← a1; for i = 2 : n, si = si−1 � ai , end.

Parallel idea: Apply simultaneous reductions to groups of elements.

With n/2 processors compute sn in dlog ne parallel operations �.

with (n− 1)/2 processors, sn−1, etc.

⇒ with O(n2) processors compute all prefixes in Tp = O(log n) with p = O(n2).

Speedup but need O(n2log n) rather than O(n) operations.




Parallel prefix algorithms

Divide-and-conquer e.g for n = 8 prefix sums

1 Partition A = {a1, ..., a8} in 2 groups

of successive elements

A1 = {a1, ..., a4} and

A2 = {a5, ..., a8}.2 Compute the prefixes of A1 and A2

independently.

3 Select last prefix term of A1 and

combine with all prefixes of A2.

Further ideas: plenty of discussions in the

literature on paralllel prefix algorithms and

tradeoffs between the metrics.




Parallel prefix algorithms

Divide-and-conquer e.g for n = 8 prefix sums

1 Partition A = {a1, ..., a8} in 2 groups

of successive elements

A1 = {a1, ..., a4} and

A2 = {a5, ..., a8}.2 Compute the prefixes of A1 and A2

independently.

3 Select last prefix term of A1 and

combine with all prefixes of A2.

Costs: Tp(n) = Tp(n/2) + 1 if enough

processors to compute sn/2 � s2,1, sn/2+1 =sn/2 � s2,2, ..., sn = sn/2 � s2,n/2 in 1 step.

Thus

Tp = O(log n)

O1 = O(n log n)

Further ideas: plenty of discussions in the

literature on paralllel prefix algorithms and

tradeoffs between the metrics.


Tridiagonal systems and solvers

Goal

Given nonsingular f ∈ Rn and A ∈ Rn×n solveα1,1 α1,2

α2,1 α2,2 α2,3

. . .. . .

. . .

. . .. . . αn−1,n

αn,n−1 αn,n

ξ1

ξ2

...

...

ξn

=

φ1

φ2

...

...

φ

tridiagonal matrices and systems are ubiquitous

Focus of myriad papers in the parallel literature

warning 1: Since for LU is T1 = O(n), need large enough matrix to make these

approaches worthwhile.

warning 2: Much literature devoted to solving multiple tridiagonal systems in

parallel (e.g. multipe right-hand sides, multiply shifted matrices, etc.) Here we

consider the case of a single system.



Parallel tridiagonal solvers

Major and recent parallel players:

Recursive Doubling Stone, Eğecioğlu et al., Davidson et al.

Cyclic Reduction Hockney and Golub, Heller, Amodio et al., Arbenz,

Hegland, Gander, Goddeke et al., ...

divide-and-conquer/partitioning Sameh, Wang, Johnsson, Wright,

Lopez et al., Dongarra, Arbenz et al., Polizzi et al., Hwu et

al., Venetis et al.

General observation:

RD and CR mostly appropriate for matrices factorizable by

diagonal pivoting.

Only few consider implementation and effects of pivoting.










al., Venetis et al.



diagonal pivoting.











al., Venetis et al.



diagonal pivoting.




Solving by Marching

Solving by Marching IObservation: if we knew ξn, then remaining values can be computed from the

recurrence

ξn−k−1 =1

αn−k,n−k−1

(φn−k − αn−k,n−kξn−k − αn−k,n−k+1ξn−k+1).

for k = 0, . . . , n− 2, with αn,n+1 = 0.

How to go about first computing ξn?

Trick: Reorder equations, moving first to last. Ifí n ≥ 3, the system becomes(R̃ b

a> 0

)(x̂

ξn

)=

(g

φ1

)where R̃ = A2:n,1:n−1.



Solving by Marching

Solving by Marching II

Matrix R̃ is invertible, upper triangular, banded with upper bandwidth 3,

b = A2:n,n, a> = A1,1:n−1, x̂ = x1:n−1, and g = f2:n. Apply block LU(R̃ b

a> 0

)=

(I 0

a>R̃−1 1

)(R̃ b

0 −a>R̃−1b

)Thus

(phase 1) + ξn = −(a>R̃−1

b)−1(φ1 − a>

R̃−1

g),

(phase 2) + x̂ = R̃−1

g − ξnR̃−1

b

Remarks:

the terms R̃−1[b, g] need be computed only once

exploit zero structure

a> = (α1,1, α1,2, 0, . . . , 0), b = (0, . . . , 0, αn−1,n, αn,n)

>.



Solving by Marching

Solving by Marching III

Metrics and remarksTp = 3 log n + O(1) with p = 4n + O(1) processors.

Op = 11n + O(1), Ep = 11

12 log n.

Fastest known algorithm for tridiagonal systems.

numerically can be unstable [Gau97]

the transformed problem requires with upper triangular R̃

which can suffer from condition complications [cf. discussion

in GPS.15]



Solving by Marching

Goal

Given nonsingular A ∈ Rn×nand f ∈ Rn

solve Ax = f :

α1,1 α1,2

α2,1 α2,2 α2,3

. . .. . .

. . .

αi−1,i−2 αi−1,i−1 αi−1,i

αi,i−1 αi,i αi,i+1

αi+1,i αi+1,i+1 αi+1,i+2

. . .. . .

. . .

αn−1,n

αn,n−1 αn,n

ξ1

ξ2

.

.

.

ξi−1

ξi

ξi+1

.

.

.

ξn

=

φ1

φ2

.

.

.

φi−1

φi

φi+1

.

.

.

φn




Recursive doubling ILet a tridiagonal matrix be irreducible; then a single equation can be expressed as a

3-term scalar recurrence:

ξi+1 = − αi,i

αi,i+1

ξi −αi,i−1

αi,i+1

ξi−1 +φi

αi,i+1

Setting x̂i =

(ξi

ξi−1

1

)we obtain a matrix 1st order recurrence

x̂i+1 = Mi x̂i ,

with

Mi =

(ρi σi τi

1 0 0

0 0 1

), x̂1 =

(ξ1

0

1

)

and

ρi = −αi,i

αi,i+1

, σi = −αi,i−1

αi,i+1

, τi =φi

αi,i+1

.




Recursive doubling II

If ξ1 were available, then the recurrence can be used to compute all x̂2, ..., x̂n:

x̂2 = M1x̂1, x̂3 = M2x̂2 = M2M1x̂1, . . . , . . . x̂n+1 = Mn · · ·M1︸︷︷︸Pn

x̂1.

From the structure of the Mi ’s and the boundary conditions ξ0 = ξn+1 = 0 we can

obtain ξ1.

(0

ξn

1

)=

(π1,1 π1,2 π1,3

π2,1 π2,2 π3,3

0 0 1

)(ξ1

0

1

)

Therefore, after the elements of Pn were available, then

0 = π1,1ξ1 + π1,3 ⇒ ξ1 = −π1,3/π1,1

We have an algorithm!




Recursive doubling III

[KS73, ECL89] Use matrix parallel prefix on the sequence M1, ...,Mn in order to

compute Pn and then x̂1, . . . , x̂n to recover x . This requires Tp = O( n

p+ log p) while

O1 = 15n + O(1).

Require: Irreducible A = [αi,i−1, αi,i , αi,i+1] ∈ Rn×n, and right-hand side f ∈ Rn

.

Ensure: Solution of Ax = f .

1: doall i = 1, ..., n2: ρi = − αi,i

αi,i+1, σi = −αi,i−1

αi,i+1, τi =

φi

αi,i+1.

3: end4: Compute the products P2 = M2M1, . . . , Pn = Mn · · ·M1 using a parallel prefix

matrix product algorithm, where Mi =

(ρi σi τi

1 0 0

0 0 1

).

5: Compute ξ1 = −(Pn)1,3/(Pn)1,1

6: doall i = 2, ..., n7: x̂i = Pi x̂1 where x̂1 = (ξ1, 0, 1)

>.

8: end9: Gather the elements of x from {x̂1, . . . , x̂n}




Cyclic Reduction I

(αi−1,i−2 αi−1,i−1 αi−1,i

αi,i−1 αi,i αi,i+1

αi+1,i αi+1,i+1 αi+1,i+2

)ξi−2

ξi−1

ξi

ξi+1

ξi+2

=

(φi−1

φi

φi+1

)

If both sides are multiplied by the row vector,

(− αi,i−1

αi−1,i−1, 1, − αi,i+1

αi+1,i+1

)we obtain

−αi,i−1

αi−1,i−1

αi−1,i−2ξi−2 + (αi,i −αi,i−1

αi−1,i−1

αi−1,i −αi,i+1

αi+1,i+1

αi+1,i )ξi −αi,i+1

αi+1,i+1

αi+1,i+2ξi+2 = φ̃i

where φ̃i = −αi,i−1

αi−1,i−1

φi−1 + φi −αi,i+1

αi+1,i+1

φi+1




Cyclic Reduction II

New equation involves only the unknowns ξi−2, ξi , ξi+2.

Only 12 ops suffice to implement the transformation.

can continue the same way, assuming n = 2k − 1 and no division by 0

encountered.

Observe: The transformations can be applied independently for

i = 2, 2× 2, . . . , 2× (2k−1 − 1) to obtain a tridiagonal system that involves

only the (even numbered) unknowns ξ2, ξ4, ..., ξ2k−1−1 and is of size

2k−1 − 1 ≈ n/2.

We have an algorithm:

Phase 1: 1 step of CR⇒ obtain 2 independent, order n/2 tridiagonal

systems

Phase 2: Back substitution

(Recursive) Cyclic Reduction:

Phase 1: log n steps of CR⇒ obtain

(2, 2k−1 − 1), (22, 2k−2 − 1), ...(2k−1, 1) tridiagonal systems

Phase 2: Back substitution




CR as Gaussian elimination on odd-even reordered matrix I[Material from GPS.15 and [GG97]]

Q = (e1, e3, ..., e2k−1, e2, e4, . . . , e2k−2), Q>

AQ =

(Do B

C>

De

),

where the diagonal blocks are diagonal matrices, specifically

Do = diag(α1,1, α3,3, . . . , α2k−1,2k−1),De = diag(α2,2, α4,4, . . . , α2k−2,2k−2)

and the off-diagonal blocks are the rectangular matrices

B =

α1,2

α3,2

. . .

. . . α2k−2,2k−2

α2k−1,2k−2

,C> =

α2,1 α2,3

α4,3 α4,5

. . .. . .

α2k−2,2k−3 α2k−2,2k−1

.

Note that B and C are of size 2k−1 × (2k−1 − 1). Then we can write

Q>

AQ =

(I2k−1 0

C>

D−1o I2k−1−1

)(Do B

0 De − C>

D−1o B

).




CR as Gaussian elimination on odd-even reordered matrix II

Therefore the system can be solved by computing the subvectors containing the odd

and even numbered unknowns x(o), x(e)

(De − C>

D−1

o B)x(e) = b(e) − C

>D−1

o f(0)

Dox(o) = f

(o) − Bx(e),

where f(o), f (e) are the subvectors containing the odd and even numbered elements

of the right-hand side.

Observe: the Schur complement De − C>

D−1o B is tridiagonal, of size 2

k−1 − 1,

because the term C>

D−1o B is tridiagonal and De diagonal.

The tridiagonal structure of C>

D−1o B is due to the fact that C

>and B are upper and

lower bidiagonal respectively (albeit not square). It also follows that CR is equivalent to

Gaussian elimination with diagonal pivoting on a reordered matrix.

Costs: 5 ops/unknown, independent for each. On p = O(n) processors, Tp = 17 log n

parallel operations and Op = 17n− 12 log n (about twice that of GE with no pivoting).




Require: A = [αi,i−1, αi,i , αi,i+1] ∈ Rn×n and f ∈ Rn where n = 2k − 1.

Ensure: Solution x of Ax = f .

{It is assumed that α(0)i,i−1

= αi,i−1, α(0)i,i = αi,i , α

(0)i,i+1

= αi,i+1 for i = 1 : n} {Reduction

stage}

1: do l = 1 : k − 1

2: doall i = 2l : 2l : n− 2l

3: ρi = −α(l−1)

i,i−2l−1/α

i−2(l−1) , τi = −α(l−1)

i,i+2l−1/α

i+2(l−1)

4: α(l)i,i−2l = ρiα

(l−1)

i,i−2l−1

5: α(l)i,i+2l = τiα

(l−1)

i,i+2l−1

6: α(l)i,i = α

(l−1)i,i + ρiα

(l−1)

i,i−2l−1+ τiα

(l−1)

i,i+2l−1

7: φ(l)i = φ

(l−1)i + ρiφ

(l−1)

i−2l−1+ τiφ

(l−1)

i+2l−1

8: end9: end

{Back substitution stage}

10: do l = k : −1 : 1

11: doall i = 2l−1 : 2l−1 : n− 2l−1

12: ξi = (φ(l−1)i − α

i,i−2l−1ξi−2l−1 − αi , i + 2l−1ξi+2l−1 )/φ

(l−1)i

13: end14: end




Numerical stability

If A is diagonally dominant by rows or columns, then the computed

solution satisfies

(A + δA)x̃ = f , ‖δA‖∞ ≤ 10 log n‖A‖∞u

and the relative forward error satisfies the bound

‖x̃ − x‖∞‖x̃‖∞

≤ 10 log nκ∞(A)u

where κ(A) = ‖A‖∞‖A−1‖∞.

Similar results exist for matrices that are:

spd,

M,

totally nonnegative ,

D1AD2 where |D1| = |D2| = I and A as above.

In those cases, it can be proved that Gaussian elimination with diagonal

pivoting to solve Ax = f succeeds and that the computed solution

satisfies (A + δA)x̂ = f where |δA| ≤ 4u|A| ignoring second order terms.




Observations

Potential inefficiencies:

Bank conflicts same memory banks being addressed

X resolved by careful tuning and data layout.

Loss of parallelism as reduction proceeds, fewer independent

equations exist, e.g. 1 at the last step, 2 at the one but

last step.

X paracr, a modified CR algorithm: apply CR irrespective

of parity, to all equations.




Matrix splitting-based paracr IShow that: Each iteration of the algorithm can be described in terms of

operations with matrices that are diagonal, stritly lower and strictly

upper triangular.

Basic splitting: Matrix gradually reduced to diagonal (as dense matrices

transformed to upper triangular). At step j the matrix will have the form

A(j) = D

(j) − L(j) − U

(j)

where D(j) is diagonal, and L(j), U(j) are strictly lower and strictly upper

triangular respectively and of a very special form.




Matrix splitting-based paracr II

Definition

[Generic (tri-) diagonal matrices]

1 A lower (resp. upper) triangular matrix is called t-upper (resp.

lower) diagonal if its only non-zero elements are at the t-th

diagonal below (resp. above) the main diagonal.

2 We name t-tridiagonal a matrix that is the sum of a diagonal, a

t-upper diagonal and a t-lower diagonal.

Note:

The t-tridiagonal matrices appear in the study of special Fibonacci

numbers

Special case of ‘‘triadic matrices’’ ([FO06]), i.e. matrices for which

there are at most 2 non-zero off-diagonal elements per column.




Matrix splitting-based paracr III

Products of t-diagonal matrices:

Lemma (See proof in p135@GPS)

1 The product of two t-lower (resp. upper) diagonal matrices of

equal size is 2t-lower (resp. upper) diagonal. Also if 2t > n then

the product is the zero matrix.

2 If L is t-lower diagonal and U is t-upper diagonal, then LU and UL

are both diagonal. Moreover the first t elements of the diagonal

of LU are 0 and so are the last t diagonal elements of UL.

Observation: Possible to establish an artihmetic on the matrix structure

with t-diagonal matrices, e.g. if the symbol T(µ) denotes a µ-upper

diagonal matrix and T(−µ) a µ-lower diagonal matrix, then for

|µ|, |ν| < n,

T(µ)T(ν) = T(µ+ ν)




Idea:

1 Write A = D − L − U where D, L, U are the diagonal, and (minus)

strictly triangular lower and upper sections A.

2 Multiply both sides by (D + L + U)D−1:

(D + L + U)D−1

A︷︸︸︷(D − L − U) x = (D + L + U)D−1

b

which can be rewritten

A(1)

x = b(1)

A(1) = D − LD

−1U − UD

−1L − LD

−1L − UD

−1U

b(1) = b + L(D−1

b) + U(D−1b)







1 Write A = D − L − U where D, L, U are the diagonal, and (minus)

strictly triangular lower and upper sections A.

2 Multiply both sides by (D + L + U)D−1:

(D + L + U)D−1

A︷︸︸︷(D − L − U) x = (D + L + U)D−1

b

which can be rewritten

A(1)

x = b(1), where

b(1) = b + L(D−1

b) + U(D−1b)

A(1) = D − LD

−1U − UD

−1L︸︷︷︸

D(1)

− LD−1

L︸︷︷︸L(1)

− UD−1

U︸︷︷︸U(1)

Observe: Terms D(1), L(1), U(1) are diagonal, and 2-diagonal (lower and

upper). Hence A(1) is 2−tridiagonal and the process can be repeated.

We have an algorithm!




Illustration

0 5 10 15

0

5

10

15

nz = 16

Matrix A(4)

0 5 10 15

0

5

10

15

nz = 44

Matrix A(1)

0 5 10 15

0

5

10

15

nz = 40

Matrix A(2)

0 5 10 15

0

5

10

15

nz = 32

Matrix A(3)




Algorithm paracr

Require: A = [λi , δi , υi+1] ∈ Rn×n and f ∈ Rn where n = 2k .

Ensure: Solution x of Ax = f .

1: l = (λ2, . . . , λn)>, d = (δ1, . . . , δn)

>, c = (υ2, . . . , υn)>.

2: p = −l; q = −u;

3: do j = 1 : k

4: σ = 2j−1;

5: p = l � d1:n−σ ; q = u � d1+σ:n

6: f = f +

(0ρ

p � f1:n−σ

)+

(q � fσ+1:n

0σ

)7: d = d −

(0σ

p � u

)−(

q � l

0σ

)8: l = pσ+1:n−σ � l1:n−2σ ; u = q1:n−2σ � u1+σ:n−σ9: end

10: x = f � d




Observations

Tp = 8 log n + O(1) parallel operations on p = O(n) processors.

Somewhat higher total Op than CR and GE.

Method applicable under same conditions as CR (diagonally

dominant or SPD type)

In many cases, the dominance of the diagonal terms increases as

the reduction proceeds.

X⇒ lends itself to implementing an early stopping strategy.




Challenge I

Design an algorithm that is robust enough to handle

general tridiagonal matrices.




Challenge I


general tridiagonal matrices.




Challenge II


singular diagonal blocks while being competitive in

speed with gtsv from CUSPARSE.




Challenge II







Challenge II







THANK YOU - QUESTIONS?




S. C. Chen and D. Kuck.

Time and parallel processor bounds for linear recurrence systems.

IEEE Trans. Comput., C-24(7):701--717, July 1975.

Ö. Eğecioğlu, K. Koc C, and A.J. Laub.

A recursive doubling algorithm for solution of tridiagonal systems on hypercube

multiprocessors.

J. Comput. Appl. Math., 27:95--108, 1989.

H.-R. Fang and D.P. O’Leary.

Stable factorizations of symmetric tridiagonal and triadic matrices.

SIAM J. Mat. Anal. Appl., 28(2):576--595, 2006.

W. Gautschi.

Numerical Analysis: An Introduction.

Birkhauser, Boston, 1997.

W. Gander and G. H Golub.

Cyclic reduction: history and applications.

In F.T. Luk and R.J. Plemmons, editors, Proc. Workshop on Scientific Computing, New York,

1997. Springer-Verlag.

N.J. Higham.

Accuracy and Stability of Numerical Algorithms.

SIAM, Philadelphia, 2nd edition, 2002.

P.M. Kogge and H.S. Stone.

A parallel algorithm for the efficient solution of a general class of recurrence equations.

IEEE Trans. Comput., C-22(8):786--793, Aug. 1973.

A.H. Sameh and R. Brent.




Solving triangular systems on a parallel computer.

SIAM J. Numer. Anal., 14(6):1101--1113, December 1977.

Documents

Recurrences and Tridiagonal solvers: From Classical Tricks ... · Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats ... Recurrences and Tridiagonal solvers: