Upload
others
View
28
Download
0
Embed Size (px)
Citation preview
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Recurrences and Tridiagonal solvers:
From Classical Tricks to Parallel Treats
Stratis Gallopoulos
1HPCLAB
Computer Engineering & Informatics Dept.
University of Patras, Greece
West Lafayette
January 2016
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Outline
1 Performance metrics (recap)
2 Linear recurrences and triangular systems
Linear recurrences: examples and matrix interpretation
Recurrences, triangular systems and parallel prefix
3 Tridiagonal systems and solvers
Solving by Marching
Solving by Recursive Doubling (not covered)
Solving by Cyclic Reduction
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Performance metrics (recap)
Performance metrics
Chapter [email protected]
Tp runtime on p processors
Op arithmetic operations on p processors
Sp speedup Sp = T1
Tp
Ep efficiency Ep = T1
pTp
Cp parallel cost Cp = pTp
Rp arithmetic redundancy Rp = Op
O1
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Linear recurrences: examples and matrix interpretation
Terminology
Problem: In many appplications we would like to compute the values of
a sequence {ξk , ξk+1, ...}, that is realized as a recurrence
ξn = γn−1ξn−1 + γn−2ξn−2 + · · ·+ γn−mξn−m + δn, where the
coefficients γj , δj are known.
The recurrence is of order m and is homogeneous when δj ≡ 0.
Recurrences can be about scalars, vectors, matrices ...
In order to be able to compute the recurrence of order m it is necesssary to hava
available m initial values, e.g. ξ1, . . . , ξm.
Recurrences are computational kernels for many computations in science and
engineering.
Goal is to compute these values quickly and accurately and implement as
efficient primitives on parallel architectures.
Obstacle: Every new value depends on previous ones and this gives the
appearance of an unavoidable sequential computation.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Linear recurrences: examples and matrix interpretation
Example: Definite integrals
Compute:
ψk =
∫1
0
ξk
ξ + 8dξ for k = 0, 1, 2, . . . , n
The following recurrence is easy to derive: ψk + 8ψk−1 = 1
k, for k ≥ 1, where ψ0 = loge(9/8).
Example: Orthogonal polynomials
Chebyshev polynomials of the first kind:
Tk(ξ) = cos(k cos−1 ξ), k = 0, 1, 2, . . . ,
satisfy the 2nd order recurrence (3-term):
Tk+1(ξ)− 2ξTk(ξ) + Tk−1(ξ) = 0
where T0(ξ) = 1, T1(ξ) = ξ.
Example: Horner’s rule
s = a( n ) ;
f o r i = n−1:−1:0
s = x∗s + a( i −1);
end
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Linear recurrences: examples and matrix interpretation
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Linear recurrences: examples and matrix interpretation
Linear recurrenceIf we express a linear recurrence for successive n,
ξ1 = φ1, ξi = φi −i−1∑j=k
λijξj , where i = 2, . . . , n and k = max{1, i − m}.
Key: The linear recurrences (??) can be expressed as a lower triangular system:
x = f − L̂x, üðïõ x = (ξ1, ξ2, . . . , ξn)>, f = (φ1, φ2, . . . , φn)
>. (1)
Example: if m = 3, then L̂ is banded lower triangular matrix:
L̂ =
0
λ21 0
λ31 λ32 0
λ41 λ42 λ43 0
λ52 λ53 λ54 0
...
...
...
...
λn,n−3 λn,n−2 λn,n−1 0
.
Setting L = I + L̂ then (1) becomes
Lx = f . (2)
Matrix L is lower triangular with unit diagonal and lower bandwidth m + 1.
Computing the values with forward substitution requires requires T1 = 2mn + O(m2) (arithmetic
operations).
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Linear recurrences: examples and matrix interpretation
Dense triangular systems and column sweep IIn a "dense recurrence" m ≈ n− 1, so 2) can be expressed as densem unit lower triangular system:
1
λ21 1
λ31 λ32 1
λ41 λ42 λ43 1
.
.
.
.
.
.
.
.
.
.
.
.
λn1 λn2 λn3 λn4 . . . λn,n−1 1
ξ1
ξ2
ξ3
ξ4
.
.
.
ξn
=
φ1
φ2
φ3
φ4
.
.
.
φn
. (3)
Starting from ξ1 = φ1 the right-hand side (3) is purifie/deflated as f − ξ1Le1, in order to form new
r.hs. corresponding to linear system of order (n− 1):
1
λ32 1
λ42 λ43 1
.
.
.
.
.
.
.
.
.
λn2 λn3 λn4 . . . 1
ξ2
ξ3
ξ4
.
.
.
ξn
=
φ(1)2
φ(1)3
φ(1)4
.
.
.
φ(1)n
,
We set φ(1)i = φi − φ1λi1, i = 2, 3, . . . , n. This can be applied recursively.
With (n− 1) processors, the algorithm requires 2(n− 1) parallel steps without any arithmetic
redundancy.
The resulting process is called column-sweep algorithm.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Linear recurrences: examples and matrix interpretation
Dense triangular systems and column sweep II
Require: Lower triangular matrix L of order n with unit diagonal, right-hand side f
Ensure: Solution of Lx = f
1: set φ(0)j = φj j = 1, . . . , n {that is f
(0) = f }
2: do i = 1 : n
3: ξi = φ(i−1)i
4: doall j = i + 1 : n
5: φ(i)j = φ
(i−1)j − φ(i−1)
i λj,i {compute f(i) = N
−1
i f(i−1)
}
6: end7: end
Observations: The method exploits the parallelism in each of the O(n) steps of the
sweep. Thus Tp = (n log n).Questions:
Can we break the sequential dependence and obtain poly-logarithmic
computational complexity, e.g. Tp = (logk
n)?
Can this be done with little redundancy and overhead?
Can this be done stably?
Can this be implemented on parallel architectures?
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Linear recurrences: examples and matrix interpretation
Dense triangular systems: The fan-in approach
Basic results
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Linear recurrences: examples and matrix interpretation
Dense triangular systems: The fan-in approach
Basic results
Theorem ([email protected] from [CK75],[SB77] )The order-n,unit lower triangular system Lx = f , can be solved in
Tp = 1
2log2 n + 3
2log n parallel arithmetic steps operations using at most
p = 15
1024n3 + O(n2) processors, with redundancy Rp = O(n).
Proof (outline) The basic idea consists of
i) Reorganizing the computation as a binary tree of height O(log n).
ii) Showing that each tree node requires at most O(log n) operations if
sufficient processing power is available.
iii) Proving the number of processors.
Key to proving (i&ii) are properties of the elementary Gauss transforms
used in LU and Gaussian elimination and the product form of L based on
these transforms.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Linear recurrences: examples and matrix interpretation
Proof I
Recall that a unit lower triangular matrix can be factorized as
L = N1N2N3 . . .Nn−1,
where
Nj =
1
1
. . .
1
λj+1,j 1
.
.
.. . .
λn,j 1
(4)
Therefore, if Lx = f then
x = N−1
n−1N−1
n−2 . . .N−1
2 N−1
1 f , (5)
where the terms N−1
j are readily available (simple sign reversal of the λ′ij s in (4).
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Linear recurrences: examples and matrix interpretation
Proof II
M(1)n
2−2
M(1)n
2−1 f
(1)M
(1)1
M(2)n
4−1
f(2)
f(log n) ≡ x
M(0)n−1
M(0)n−2
M(0)n−3
M(0)n−4 · · · M
(0)3 M
(0)2 M
(0)1 f
(0)
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Linear recurrences: examples and matrix interpretation
Algorithm
Algorithm DTS: Triangular solver based on a fan-in approach
Require: Lower triangular matrix L of order n = 2µ and right-hand side f
Ensure: Solution x of Lx = f
1: f (0) = f , M(0)i = N
−1
i , i = 1, . . . , n− 1 where Nj as in Eq. (4)
2: do j = 0 : µ− 1
3: f (j+1) = M(j)1 f (j)
4: doall k = 1 : (n/2j+1)− 1
5: M(j+1)k = M
(j)2k+1M
(j)2k
6: end7: end
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Linear recurrences: examples and matrix interpretation
Algorithm
Algorithm DTS: Triangular solver based on a fan-in approach
Require: Lower triangular matrix L of order n = 2µ and right-hand side f
Ensure: Solution x of Lx = f
1: f (0) = f , M(0)i = N
−1
i , i = 1, . . . , n− 1 where Nj as in Eq. (4)
2: do j = 0 : µ− 1
3: f (j+1) = M(j)1 f (j)
4: doall k = 1 : (n/2j+1)− 1
5: M(j+1)k = M
(j)2k+1M
(j)2k
6: end7: end
Numerical issues
care is needed regarding the numerical stability of recurrence solvers
Algorithmic stability: masterly survey by Higham [Hig02]
Additional issue related to conditioning (save for later)
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Linear recurrences: examples and matrix interpretation
Banded recurrences
Some results (Ch. 3@ GPS.15)
1 Let L be a banded unit lower triangular matrix of bandwidth
m + 1, where m ≤ n/2, and λik = 0 for i − k > m. Then Lx = f
can be solved in less than Tp = (2 + log m) log n parallel steps
using fewer than p = m(m + 1)n/2 processors. [Th 3.2]
2 An algorithm can be designed using
Op = m2n log(n/2m) + O(mn log n) arithmetic operations, and
arithmetic redundancy Rp = O(m log n). [Cor. 3.1]
3 More refined, lower complexity results can be obtained for lowertriangular Toeplitz matrices.
4 If L dense Toeplitz unit lower triangular, then Lx = f can be solved in about
Tp = log2
n + 2 log n parallel steps, witn p ≤ n2/4. [Th. 3.3]
5 If L is also banded with (m + 1) ≥ 3 then the system can be solved in about
Tp = (2 log m + 3) log n with p ≤ 3mn/4. [Th. 3.4]
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Recurrences: with prefixes
Let ξi = αiξi−1 + βi and compute by exanding repeatedly:
ξ1 = α1ξ0 + β1
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Recurrences: with prefixes
Let ξi = αiξi−1 + βi and compute by exanding repeatedly:
ξ1 = α1ξ0 + β1
ξ2 = α2ξ1 + β2
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Recurrences: with prefixes
Let ξi = αiξi−1 + βi and compute by exanding repeatedly:
ξ1 = α1ξ0 + β1
ξ2 = α2(α1ξ0 + β1) + β2
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Recurrences: with prefixes
Let ξi = αiξi−1 + βi and compute by exanding repeatedly:
ξ1 = α1ξ0 + β1
ξ2 = α2α1ξ0 + α2β1 + β2
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Recurrences: with prefixes
Let ξi = αiξi−1 + βi and compute by exanding repeatedly:
ξ1 = α1ξ0 + β1
ξ2 = α2α1ξ0 + α2β1 + β2
ξ3 = α3ξ2 + β3
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Recurrences: with prefixes
Let ξi = αiξi−1 + βi and compute by exanding repeatedly:
ξ1 = α1ξ0 + β1
ξ2 = α2α1ξ0 + α2β1 + β2
ξ3 = α3(α2ξ1 + β2) + β3
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Recurrences: with prefixes
Let ξi = αiξi−1 + βi and compute by exanding repeatedly:
ξ1 = α1ξ0 + β1
ξ2 = α2α1ξ0 + α2β1 + β2
ξ3 = α3(α2(α1ξ0 + β1) + β2) + β3
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Recurrences: with prefixes
Let ξi = αiξi−1 + βi and compute by exanding repeatedly:
ξ1 = α1ξ0 + β1
ξ2 = α2α1ξ0 + α2β1 + β2
ξ3 = α3α2α1ξ0 + α3α2β1 + α3β2 + β3
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Recurrences: with prefixes
Let ξi = αiξi−1 + βi and compute by exanding repeatedly:
ξ1 = α1ξ0 + β1
ξ2 = α2α1ξ0 + α2β1 + β2
ξ3 = α3α2α1ξ0 + α3α2β1 + α3β2 + β3
.
.
.. . .
ξn = αn · · ·α1ξ0 + αn · · ·α2β1 + · · ·+ αnαn−1βn−2 + αnβn−1 + βn
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Recurrences: with prefixes
Let ξi = αiξi−1 + βi and compute by exanding repeatedly:
ξ1 = α1ξ0 + β1
ξ2 = α2α1ξ0 + α2β1 + β2
ξ3 = α3α2α1ξ0 + α3α2β1 + α3β2 + β3
.
.
.. . .
ξn = αn · · ·α1ξ0 + αn · · ·α2β1 + · · ·+ αnαn−1βn−2 + αnβn−1 + βn
We can write
ξn = (αn · · ·α1 αn · · ·α2 ... αnαn−1 αn 1)
ξ0
β1
.
.
.
βn−2
βn−1
βn
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Recurrences: with prefixesLet ξi = αiξi−1 + βi and compute by exanding repeatedly:
ξn = αn · · ·α1ξ0 + αn · · ·α2β1 + · · ·+ αnαn−1βn−2 + αnβn−1 + βn
We can write
ξn = (αn · · ·α1 αn · · ·α2 ... αnαn−1 αn 1)
ξ0
β1
.
.
.
βn−2
βn−1
βn
Observe: Important role played by the products
αn
αnαn−1
. . .
αnαn−1 · · ·α2
αnαn−1 · · ·α2α1
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Prefixes
These terms are called prefixes with respect to multplication or simply
prefix products of the (ordered) sequence
(αn, αn−1, . . . , α2, α1)
How to do this efficiently in parallel?
Computing prefixes of sequences with respect to some associative
operation is a kernel computation in many applications.
It comes under various names and flavors
Depending on the operation we can have prefix sums
Sometimes natural to refer to suffixes.
In the literature (and computer manuals) another term is scan.
The sequence could also consist of compatible vectors and
matrices of compatible dimensions
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Parallel prefix problem
Given A = (a1, ..., an) and an associative binary operation �, compute the n
prefixes:
a1
a1 � a2
a1 � a2 � a3
· · · · · · · · · · · ·a1 � a2 � · · · � an.
On 1 processor, the algorithm is obvious:
s1 ← a1; for i = 2 : n, si = si−1 � ai , end.
Parallel idea: Apply simultaneous reductions to groups of elements.
With n/2 processors compute sn in dlog ne parallel operations �.
with (n− 1)/2 processors, sn−1, etc.
⇒ with O(n2) processors compute all prefixes in Tp = O(log n) with p = O(n2).
Speedup but need O(n2log n) rather than O(n) operations.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Parallel prefix problem
Given A = (a1, ..., an) and an associative binary operation �, compute the n
prefixes:
a1
a1 � a2
a1 � a2 � a3
· · · · · · · · · · · ·a1 � a2 � · · · � an.
On 1 processor, the algorithm is obvious:
s1 ← a1; for i = 2 : n, si = si−1 � ai , end.
Parallel idea: Apply simultaneous reductions to groups of elements.
With n/2 processors compute sn in dlog ne parallel operations �.
with (n− 1)/2 processors, sn−1, etc.
⇒ with O(n2) processors compute all prefixes in Tp = O(log n) with p = O(n2).
Speedup but need O(n2log n) rather than O(n) operations.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Parallel prefix algorithms
Divide-and-conquer e.g for n = 8 prefix sums
1 Partition A = {a1, ..., a8} in 2 groups
of successive elements
A1 = {a1, ..., a4} and
A2 = {a5, ..., a8}.2 Compute the prefixes of A1 and A2
independently.
3 Select last prefix term of A1 and
combine with all prefixes of A2.
Further ideas: plenty of discussions in the
literature on paralllel prefix algorithms and
tradeoffs between the metrics.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Linear recurrences and triangular systems
Recurrences, triangular systems and parallel prefix
Parallel prefix algorithms
Divide-and-conquer e.g for n = 8 prefix sums
1 Partition A = {a1, ..., a8} in 2 groups
of successive elements
A1 = {a1, ..., a4} and
A2 = {a5, ..., a8}.2 Compute the prefixes of A1 and A2
independently.
3 Select last prefix term of A1 and
combine with all prefixes of A2.
Costs: Tp(n) = Tp(n/2) + 1 if enough
processors to compute sn/2 � s2,1, sn/2+1 =sn/2 � s2,2, ..., sn = sn/2 � s2,n/2 in 1 step.
Thus
Tp = O(log n)
O1 = O(n log n)
Further ideas: plenty of discussions in the
literature on paralllel prefix algorithms and
tradeoffs between the metrics.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Goal
Given nonsingular f ∈ Rn and A ∈ Rn×n solveα1,1 α1,2
α2,1 α2,2 α2,3
. . .. . .
. . .
. . .. . . αn−1,n
αn,n−1 αn,n
ξ1
ξ2
...
...
ξn
=
φ1
φ2
...
...
φ
tridiagonal matrices and systems are ubiquitous
Focus of myriad papers in the parallel literature
warning 1: Since for LU is T1 = O(n), need large enough matrix to make these
approaches worthwhile.
warning 2: Much literature devoted to solving multiple tridiagonal systems in
parallel (e.g. multipe right-hand sides, multiply shifted matrices, etc.) Here we
consider the case of a single system.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Parallel tridiagonal solvers
Major and recent parallel players:
Recursive Doubling Stone, Eğecioğlu et al., Davidson et al.
Cyclic Reduction Hockney and Golub, Heller, Amodio et al., Arbenz,
Hegland, Gander, Goddeke et al., ...
divide-and-conquer/partitioning Sameh, Wang, Johnsson, Wright,
Lopez et al., Dongarra, Arbenz et al., Polizzi et al., Hwu et
al., Venetis et al.
General observation:
RD and CR mostly appropriate for matrices factorizable by
diagonal pivoting.
Only few consider implementation and effects of pivoting.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Parallel tridiagonal solvers
Major and recent parallel players:
Recursive Doubling Stone, Eğecioğlu et al., Davidson et al.
Cyclic Reduction Hockney and Golub, Heller, Amodio et al., Arbenz,
Hegland, Gander, Goddeke et al., ...
divide-and-conquer/partitioning Sameh, Wang, Johnsson, Wright,
Lopez et al., Dongarra, Arbenz et al., Polizzi et al., Hwu et
al., Venetis et al.
General observation:
RD and CR mostly appropriate for matrices factorizable by
diagonal pivoting.
Only few consider implementation and effects of pivoting.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Parallel tridiagonal solvers
Major and recent parallel players:
Recursive Doubling Stone, Eğecioğlu et al., Davidson et al.
Cyclic Reduction Hockney and Golub, Heller, Amodio et al., Arbenz,
Hegland, Gander, Goddeke et al., ...
divide-and-conquer/partitioning Sameh, Wang, Johnsson, Wright,
Lopez et al., Dongarra, Arbenz et al., Polizzi et al., Hwu et
al., Venetis et al.
General observation:
RD and CR mostly appropriate for matrices factorizable by
diagonal pivoting.
Only few consider implementation and effects of pivoting.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Marching
Solving by Marching IObservation: if we knew ξn, then remaining values can be computed from the
recurrence
ξn−k−1 =1
αn−k,n−k−1
(φn−k − αn−k,n−kξn−k − αn−k,n−k+1ξn−k+1).
for k = 0, . . . , n− 2, with αn,n+1 = 0.
How to go about first computing ξn?
Trick: Reorder equations, moving first to last. Ifí n ≥ 3, the system becomes(R̃ b
a> 0
)(x̂
ξn
)=
(g
φ1
)where R̃ = A2:n,1:n−1.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Marching
Solving by Marching II
Matrix R̃ is invertible, upper triangular, banded with upper bandwidth 3,
b = A2:n,n, a> = A1,1:n−1, x̂ = x1:n−1, and g = f2:n. Apply block LU(R̃ b
a> 0
)=
(I 0
a>R̃−1 1
)(R̃ b
0 −a>R̃−1b
)Thus
(phase 1) + ξn = −(a>R̃−1
b)−1(φ1 − a>
R̃−1
g),
(phase 2) + x̂ = R̃−1
g − ξnR̃−1
b
Remarks:
the terms R̃−1[b, g] need be computed only once
exploit zero structure
a> = (α1,1, α1,2, 0, . . . , 0), b = (0, . . . , 0, αn−1,n, αn,n)
>.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Marching
Solving by Marching III
Metrics and remarksTp = 3 log n + O(1) with p = 4n + O(1) processors.
Op = 11n + O(1), Ep = 11
12 log n.
Fastest known algorithm for tridiagonal systems.
numerically can be unstable [Gau97]
the transformed problem requires with upper triangular R̃
which can suffer from condition complications [cf. discussion
in GPS.15]
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Marching
Goal
Given nonsingular A ∈ Rn×nand f ∈ Rn
solve Ax = f :
α1,1 α1,2
α2,1 α2,2 α2,3
. . .. . .
. . .
αi−1,i−2 αi−1,i−1 αi−1,i
αi,i−1 αi,i αi,i+1
αi+1,i αi+1,i+1 αi+1,i+2
. . .. . .
. . .
αn−1,n
αn,n−1 αn,n
ξ1
ξ2
.
.
.
ξi−1
ξi
ξi+1
.
.
.
ξn
=
φ1
φ2
.
.
.
φi−1
φi
φi+1
.
.
.
φn
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Recursive Doubling (not covered)
Recursive doubling ILet a tridiagonal matrix be irreducible; then a single equation can be expressed as a
3-term scalar recurrence:
ξi+1 = − αi,i
αi,i+1
ξi −αi,i−1
αi,i+1
ξi−1 +φi
αi,i+1
Setting x̂i =
(ξi
ξi−1
1
)we obtain a matrix 1st order recurrence
x̂i+1 = Mi x̂i ,
with
Mi =
(ρi σi τi
1 0 0
0 0 1
), x̂1 =
(ξ1
0
1
)
and
ρi = −αi,i
αi,i+1
, σi = −αi,i−1
αi,i+1
, τi =φi
αi,i+1
.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Recursive Doubling (not covered)
Recursive doubling II
If ξ1 were available, then the recurrence can be used to compute all x̂2, ..., x̂n:
x̂2 = M1x̂1, x̂3 = M2x̂2 = M2M1x̂1, . . . , . . . x̂n+1 = Mn · · ·M1︸ ︷︷ ︸Pn
x̂1.
From the structure of the Mi ’s and the boundary conditions ξ0 = ξn+1 = 0 we can
obtain ξ1.
(0
ξn
1
)=
(π1,1 π1,2 π1,3
π2,1 π2,2 π3,3
0 0 1
)(ξ1
0
1
)
Therefore, after the elements of Pn were available, then
0 = π1,1ξ1 + π1,3 ⇒ ξ1 = −π1,3/π1,1
We have an algorithm!
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Recursive Doubling (not covered)
Recursive doubling III
[KS73, ECL89] Use matrix parallel prefix on the sequence M1, ...,Mn in order to
compute Pn and then x̂1, . . . , x̂n to recover x . This requires Tp = O( n
p+ log p) while
O1 = 15n + O(1).
Require: Irreducible A = [αi,i−1, αi,i , αi,i+1] ∈ Rn×n, and right-hand side f ∈ Rn
.
Ensure: Solution of Ax = f .
1: doall i = 1, ..., n2: ρi = − αi,i
αi,i+1, σi = −αi,i−1
αi,i+1, τi =
φi
αi,i+1.
3: end4: Compute the products P2 = M2M1, . . . , Pn = Mn · · ·M1 using a parallel prefix
matrix product algorithm, where Mi =
(ρi σi τi
1 0 0
0 0 1
).
5: Compute ξ1 = −(Pn)1,3/(Pn)1,1
6: doall i = 2, ..., n7: x̂i = Pi x̂1 where x̂1 = (ξ1, 0, 1)
>.
8: end9: Gather the elements of x from {x̂1, . . . , x̂n}
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Cyclic Reduction I
(αi−1,i−2 αi−1,i−1 αi−1,i
αi,i−1 αi,i αi,i+1
αi+1,i αi+1,i+1 αi+1,i+2
)ξi−2
ξi−1
ξi
ξi+1
ξi+2
=
(φi−1
φi
φi+1
)
If both sides are multiplied by the row vector,
(− αi,i−1
αi−1,i−1, 1, − αi,i+1
αi+1,i+1
)we obtain
−αi,i−1
αi−1,i−1
αi−1,i−2ξi−2 + (αi,i −αi,i−1
αi−1,i−1
αi−1,i −αi,i+1
αi+1,i+1
αi+1,i )ξi −αi,i+1
αi+1,i+1
αi+1,i+2ξi+2 = φ̃i
where φ̃i = −αi,i−1
αi−1,i−1
φi−1 + φi −αi,i+1
αi+1,i+1
φi+1
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Cyclic Reduction II
New equation involves only the unknowns ξi−2, ξi , ξi+2.
Only 12 ops suffice to implement the transformation.
can continue the same way, assuming n = 2k − 1 and no division by 0
encountered.
Observe: The transformations can be applied independently for
i = 2, 2× 2, . . . , 2× (2k−1 − 1) to obtain a tridiagonal system that involves
only the (even numbered) unknowns ξ2, ξ4, ..., ξ2k−1−1 and is of size
2k−1 − 1 ≈ n/2.
We have an algorithm:
Phase 1: 1 step of CR⇒ obtain 2 independent, order n/2 tridiagonal
systems
Phase 2: Back substitution
(Recursive) Cyclic Reduction:
Phase 1: log n steps of CR⇒ obtain
(2, 2k−1 − 1), (22, 2k−2 − 1), ...(2k−1, 1) tridiagonal systems
Phase 2: Back substitution
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
CR as Gaussian elimination on odd-even reordered matrix I[Material from GPS.15 and [GG97]]
Q = (e1, e3, ..., e2k−1, e2, e4, . . . , e2k−2), Q>
AQ =
(Do B
C>
De
),
where the diagonal blocks are diagonal matrices, specifically
Do = diag(α1,1, α3,3, . . . , α2k−1,2k−1),De = diag(α2,2, α4,4, . . . , α2k−2,2k−2)
and the off-diagonal blocks are the rectangular matrices
B =
α1,2
α3,2
. . .
. . . α2k−2,2k−2
α2k−1,2k−2
,C> =
α2,1 α2,3
α4,3 α4,5
. . .. . .
α2k−2,2k−3 α2k−2,2k−1
.
Note that B and C are of size 2k−1 × (2k−1 − 1). Then we can write
Q>
AQ =
(I2k−1 0
C>
D−1o I2k−1−1
)(Do B
0 De − C>
D−1o B
).
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
CR as Gaussian elimination on odd-even reordered matrix II
Therefore the system can be solved by computing the subvectors containing the odd
and even numbered unknowns x(o), x(e)
(De − C>
D−1
o B)x(e) = b(e) − C
>D−1
o f(0)
Dox(o) = f
(o) − Bx(e),
where f(o), f (e) are the subvectors containing the odd and even numbered elements
of the right-hand side.
Observe: the Schur complement De − C>
D−1o B is tridiagonal, of size 2
k−1 − 1,
because the term C>
D−1o B is tridiagonal and De diagonal.
The tridiagonal structure of C>
D−1o B is due to the fact that C
>and B are upper and
lower bidiagonal respectively (albeit not square). It also follows that CR is equivalent to
Gaussian elimination with diagonal pivoting on a reordered matrix.
Costs: 5 ops/unknown, independent for each. On p = O(n) processors, Tp = 17 log n
parallel operations and Op = 17n− 12 log n (about twice that of GE with no pivoting).
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Require: A = [αi,i−1, αi,i , αi,i+1] ∈ Rn×n and f ∈ Rn where n = 2k − 1.
Ensure: Solution x of Ax = f .
{It is assumed that α(0)i,i−1
= αi,i−1, α(0)i,i = αi,i , α
(0)i,i+1
= αi,i+1 for i = 1 : n} {Reduction
stage}
1: do l = 1 : k − 1
2: doall i = 2l : 2l : n− 2l
3: ρi = −α(l−1)
i,i−2l−1/α
i−2(l−1) , τi = −α(l−1)
i,i+2l−1/α
i+2(l−1)
4: α(l)i,i−2l = ρiα
(l−1)
i,i−2l−1
5: α(l)i,i+2l = τiα
(l−1)
i,i+2l−1
6: α(l)i,i = α
(l−1)i,i + ρiα
(l−1)
i,i−2l−1+ τiα
(l−1)
i,i+2l−1
7: φ(l)i = φ
(l−1)i + ρiφ
(l−1)
i−2l−1+ τiφ
(l−1)
i+2l−1
8: end9: end
{Back substitution stage}
10: do l = k : −1 : 1
11: doall i = 2l−1 : 2l−1 : n− 2l−1
12: ξi = (φ(l−1)i − α
i,i−2l−1ξi−2l−1 − αi , i + 2l−1ξi+2l−1 )/φ
(l−1)i
13: end14: end
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Numerical stability
If A is diagonally dominant by rows or columns, then the computed
solution satisfies
(A + δA)x̃ = f , ‖δA‖∞ ≤ 10 log n‖A‖∞u
and the relative forward error satisfies the bound
‖x̃ − x‖∞‖x̃‖∞
≤ 10 log nκ∞(A)u
where κ(A) = ‖A‖∞‖A−1‖∞.
Similar results exist for matrices that are:
spd,
M,
totally nonnegative ,
D1AD2 where |D1| = |D2| = I and A as above.
In those cases, it can be proved that Gaussian elimination with diagonal
pivoting to solve Ax = f succeeds and that the computed solution
satisfies (A + δA)x̂ = f where |δA| ≤ 4u|A| ignoring second order terms.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Observations
Potential inefficiencies:
Bank conflicts same memory banks being addressed
X resolved by careful tuning and data layout.
Loss of parallelism as reduction proceeds, fewer independent
equations exist, e.g. 1 at the last step, 2 at the one but
last step.
X paracr, a modified CR algorithm: apply CR irrespective
of parity, to all equations.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Matrix splitting-based paracr IShow that: Each iteration of the algorithm can be described in terms of
operations with matrices that are diagonal, stritly lower and strictly
upper triangular.
Basic splitting: Matrix gradually reduced to diagonal (as dense matrices
transformed to upper triangular). At step j the matrix will have the form
A(j) = D
(j) − L(j) − U
(j)
where D(j) is diagonal, and L(j), U(j) are strictly lower and strictly upper
triangular respectively and of a very special form.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Matrix splitting-based paracr II
Definition
[Generic (tri-) diagonal matrices]
1 A lower (resp. upper) triangular matrix is called t-upper (resp.
lower) diagonal if its only non-zero elements are at the t-th
diagonal below (resp. above) the main diagonal.
2 We name t-tridiagonal a matrix that is the sum of a diagonal, a
t-upper diagonal and a t-lower diagonal.
Note:
The t-tridiagonal matrices appear in the study of special Fibonacci
numbers
Special case of ‘‘triadic matrices’’ ([FO06]), i.e. matrices for which
there are at most 2 non-zero off-diagonal elements per column.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Matrix splitting-based paracr III
Products of t-diagonal matrices:
Lemma (See proof in p135@GPS)
1 The product of two t-lower (resp. upper) diagonal matrices of
equal size is 2t-lower (resp. upper) diagonal. Also if 2t > n then
the product is the zero matrix.
2 If L is t-lower diagonal and U is t-upper diagonal, then LU and UL
are both diagonal. Moreover the first t elements of the diagonal
of LU are 0 and so are the last t diagonal elements of UL.
Observation: Possible to establish an artihmetic on the matrix structure
with t-diagonal matrices, e.g. if the symbol T(µ) denotes a µ-upper
diagonal matrix and T(−µ) a µ-lower diagonal matrix, then for
|µ|, |ν| < n,
T(µ)T(ν) = T(µ+ ν)
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Idea:
1 Write A = D − L − U where D, L, U are the diagonal, and (minus)
strictly triangular lower and upper sections A.
2 Multiply both sides by (D + L + U)D−1:
(D + L + U)D−1
A︷ ︸︸ ︷(D − L − U) x = (D + L + U)D−1
b
which can be rewritten
A(1)
x = b(1)
A(1) = D − LD
−1U − UD
−1L − LD
−1L − UD
−1U
b(1) = b + L(D−1
b) + U(D−1b)
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
1 Write A = D − L − U where D, L, U are the diagonal, and (minus)
strictly triangular lower and upper sections A.
2 Multiply both sides by (D + L + U)D−1:
(D + L + U)D−1
A︷ ︸︸ ︷(D − L − U) x = (D + L + U)D−1
b
which can be rewritten
A(1)
x = b(1), where
b(1) = b + L(D−1
b) + U(D−1b)
A(1) = D − LD
−1U − UD
−1L︸ ︷︷ ︸
D(1)
− LD−1
L︸ ︷︷ ︸L(1)
− UD−1
U︸ ︷︷ ︸U(1)
Observe: Terms D(1), L(1), U(1) are diagonal, and 2-diagonal (lower and
upper). Hence A(1) is 2−tridiagonal and the process can be repeated.
We have an algorithm!
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Illustration
0 5 10 15
0
5
10
15
nz = 16
Matrix A(4)
0 5 10 15
0
5
10
15
nz = 44
Matrix A(1)
0 5 10 15
0
5
10
15
nz = 40
Matrix A(2)
0 5 10 15
0
5
10
15
nz = 32
Matrix A(3)
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Algorithm paracr
Require: A = [λi , δi , υi+1] ∈ Rn×n and f ∈ Rn where n = 2k .
Ensure: Solution x of Ax = f .
1: l = (λ2, . . . , λn)>, d = (δ1, . . . , δn)
>, c = (υ2, . . . , υn)>.
2: p = −l; q = −u;
3: do j = 1 : k
4: σ = 2j−1;
5: p = l � d1:n−σ ; q = u � d1+σ:n
6: f = f +
(0ρ
p � f1:n−σ
)+
(q � fσ+1:n
0σ
)7: d = d −
(0σ
p � u
)−(
q � l
0σ
)8: l = pσ+1:n−σ � l1:n−2σ ; u = q1:n−2σ � u1+σ:n−σ9: end
10: x = f � d
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Observations
Tp = 8 log n + O(1) parallel operations on p = O(n) processors.
Somewhat higher total Op than CR and GE.
Method applicable under same conditions as CR (diagonally
dominant or SPD type)
In many cases, the dominance of the diagonal terms increases as
the reduction proceeds.
X⇒ lends itself to implementing an early stopping strategy.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Challenge I
Design an algorithm that is robust enough to handle
general tridiagonal matrices.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Challenge I
Design an algorithm that is robust enough to handle
general tridiagonal matrices.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Challenge II
Design an algorithm that is robust enough to handle
singular diagonal blocks while being competitive in
speed with gtsv from CUSPARSE.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Challenge II
Design an algorithm that is robust enough to handle
singular diagonal blocks while being competitive in
speed with gtsv from CUSPARSE.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Challenge II
Design an algorithm that is robust enough to handle
singular diagonal blocks while being competitive in
speed with gtsv from CUSPARSE.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
THANK YOU - QUESTIONS?
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
S. C. Chen and D. Kuck.
Time and parallel processor bounds for linear recurrence systems.
IEEE Trans. Comput., C-24(7):701--717, July 1975.
Ö. Eğecioğlu, K. Koc C, and A.J. Laub.
A recursive doubling algorithm for solution of tridiagonal systems on hypercube
multiprocessors.
J. Comput. Appl. Math., 27:95--108, 1989.
H.-R. Fang and D.P. O’Leary.
Stable factorizations of symmetric tridiagonal and triadic matrices.
SIAM J. Mat. Anal. Appl., 28(2):576--595, 2006.
W. Gautschi.
Numerical Analysis: An Introduction.
Birkhauser, Boston, 1997.
W. Gander and G. H Golub.
Cyclic reduction: history and applications.
In F.T. Luk and R.J. Plemmons, editors, Proc. Workshop on Scientific Computing, New York,
1997. Springer-Verlag.
N.J. Higham.
Accuracy and Stability of Numerical Algorithms.
SIAM, Philadelphia, 2nd edition, 2002.
P.M. Kogge and H.S. Stone.
A parallel algorithm for the efficient solution of a general class of recurrence equations.
IEEE Trans. Comput., C-22(8):786--793, Aug. 1973.
A.H. Sameh and R. Brent.
Recurrences and Tridiagonal solvers: From Classical Tricks to Parallel Treats
Tridiagonal systems and solvers
Solving by Cyclic Reduction
Solving triangular systems on a parallel computer.
SIAM J. Numer. Anal., 14(6):1101--1113, December 1977.