Parallel reduction of banded matrices to bidiagonal form

cQ&4 -T -_ li!B

ELSEVIER Parallel Computing 22 (1996) 1-18

PARALLEL COMPUTING

Parallel reduction of banded matrices to bidiagonal form T

Bruno Lang *

Berg&he Uniuersitiit GH Wuppertal, Fachbereich Mathematik, GauJJstr. 20,

D-42097 Wuppertal, Gemany

Received 7 April 1995

Abstract

A parallel algorithm for reducing banded matrices to bidiagonal form is presented. In contrast to the rotation-based “standard approach”, our algorithm is based on Householder transforms, therefore exhibiting considerably higher data locality (BLAS level 2 instead of level 1). The update of the transformation matrices which involves the vast majority of the operations can even be blocked to allow the use of level 3 BLAS. Thus, our algorithm will outperform the standard method on a serial computer with a distinct memory hierarchy. In addition, the algorithm can be efficiently implemented in a distributed memory environ- ment, as is demonstrated by numerical results on the Intel Paragon.

Keywords: Linear algebra; Singular value decomposition; BLAS hierarchy; Distributed memory multiprocessors; Banded matrices; Parallel reduction

1. Introduction

To compute the singular value decomposition (SI/D) A = UZVT of a matrix A E Rmx”, one proceeds in two major steps. First, the matrix is reduced to bidiagonal form, A = U,SV1’, and then the Golub/Kahan procedure [7] is used to compute the SVD B = U,W,’ of the bidiagonal matrix B. (Thus, U = iJ,U, and v= V,V,.)

If the matrix A is full then the usual procedure for bidiagonal reduction is to apply ‘full length’ Householder transformations alternatively from the left and from the right, as described in [7], [6].

t The numerical experiments were performed on the Intel Paragon parallel computers at the Eidgeniissische Technische Hochschule, Ziirich, and at the Zentralinstitut fir Angewandte Mathe-

matik, Forschungszentrum Jiilich GmbH. * Email: [email protected]

0167-8191/96/$15.00 0 1996 Elsevier Science B.V. All rights reserved SSDZ 0167-8191(95)00064-X

2 B. Lang/Parallel Computing 22 (1996) I-18

For banded matrices however, this algorithm is not optimal because it does not make use of the banded structure. After a few steps, the sparsity of A is completely lost. Therefore, the number of operations and the memory requirements almost equal the costs for reducing a full matrix.

The ‘chasing algorithm’ -GBBR D included in the LAPACK2.0 library [ll preserves the banded structure of A by immediately removing any fill-in. In each step of this algorithm, a first rotation eliminates one element of the band (thereby generating one new off-band element), and then a sequence of rotations is used to chase the off-band element down along the band.

The reduction algorithm H Red presented in this paper is based on Householder transformations [lo], [S]. It also implements a chasing strategy, but it does not remove all fill-in. Thus, the memory requirements and the number of floating point operations for the reduction of A are somewhat higher than with -GBBR D.

On the other hand, Householder transforms can be implemented using level 2 BLAS which usually perform significantly better than the level 1 rotations [9], [51.

If the transformation matrices U and V are also needed then the overall number of operations in H R ed is reduced by almost one third as compared to -GBBR D. In addition, the update of these matrices (involving the vast majority of the operations) can be done in a blocked fashion, therefore enabling the use of the level 3 BLAS [4].

Furthermore, H Red is well suited to parallel execution, as is demonstrated by numerical results on the Intel Paragon. If the matrices are ‘reasonably large’ then nearly full speedup can be obtained.

The remainder of the paper is organized as follows. In Section 2, a block decomposition for the banded matrix and the serial version of the algorithm H R e d

are presented. Section 3 shows how to efficiently parallelize the reduction algorithm by mapping the blocks to different processors. The update of the matrices U and I/ is treated in some detail in Section 4. In Section 5 we summarize the numerical experiments on the Intel Paragon.

2. The serial reduction algorithm

Throughout the paper, A = (a,j) always denotes a banded matrix of size m X n with lower bandwidth b, and upper bandwidth 6, (i.e., aij = 0 for i <j-b, or i > j + 6,). Let b = 6, + b, be the number of off-diagonals and p = min(m, n>.

We will first consider the case m 2 n, where A is reduced to upper bidiagonal form. Then, the reduction proceeds in ,u = II sweeps down the band, each of which produces one row of the resulting bidiagonal matrix B.

For the first sweep, the band is partitioned into blocks as depicted in Fig. la. The first block D, contains just the first column of the band. The remaining columns of the band are partitioned into block columns Aj of width b, and each block column is again subdivided into an upper block Ej and a lower block Dj. Thus, except for D, (size (1 + 6,) x 11, E, (size (1 + b,) x b) and two blocks at the tail of the band, all the Dj and Ej are of size b X b.

B. Lang/Parallel Computing 22 (1996) I-18 3

a b C

Fig. 1. Partitioning of the banded matrix into blocks during the first, second and eighth sweep of the

algorithm (pictures a, b, and c, resp.) for m = 27, n = 25, b, = 3, and b, = 2.

To keep the following description more concise, let “H zeroes x” be a shorthand for “the Householder transform H zeroes all but the first elements of x”. We first determine a left Householder transform iJii that zeroes D,. Then, the remaining block columns Aj = (s.), j = 2, 3,. . . , are modified as follows. The old left transform C$.‘i is applied to ‘Ej. This completely destroys the zeros that were present in E,. Then we determine a right Householder transform 5.’ that zeroes the first row of Ej and apply it to Ej and D!, thereby destroying the zeros in Dj. Finally, a new left transform Uj’ zeroes the first column of 0,.

In Fig. 1, the areas shaded medium grey represent elements that were nonzero before the respective sweep, whereas light grey stands for fill-in produced during the sweep. The elements enclosed in dashed lines are made zero during the sweep.

For the second sweep we shift the block partitioning by one row to the bottom and by one column to the right, cf. Fig. l(b). Note that no nonzero element is lost in the shifted block decomposition as all the elements that are not included in the new Dj or Ej were made zero during the first sweep. Then we determine Householder transforms U:, 7/22, 1722, I’:, Ut, analogously to the first sweep.

In matrix notation, the kth sweep replaces Ak- ’ (with A” := A) by Ak = ,ykAk-lvk where U k = diag(Z,- i, U:, Ut, Ul, . . . > and V k = diag(Z,, v;, v;, v:, . . . ). After the kth sweep, the leading k X (k + 1) block Bk of A contains the first k rows of the resulting upper bidiagonal matrix B.

The reduction procedure is summarized in the following algorithm.

Algorithm 1. Serial reduction to upper bidiagonal form. for k = 1 to II

(adopt the block decomposition for the k th sweep) D, := UF . D, where iJ: zeroes D, for j = 2, 3,. . .

B. Lang/Parallel Computing 22 (1996) l-18

Fig. 2. Partitioning of the banded matrix for reduction to lower bidiagonal form (m = 20, n = 21, 6, = 3, b, = 2).

A, := Aj. l-$’ where yk zeroes the first row of E, Di := q”. Dj where l$b zeroes the first column of 0,.

For m <II we reduce A to lower bidiagonal form. In this case, a slightly modified block decomposition is employed, see Fig. 2. The first block column contains b, + 1 columns and is partitioned into the blocks E, of size 1 X (b, + 1) and D, of size b x (b, + 1). The other blocks are b x b. Now we first apply a right transform V,’ to (ED:) such that E, is zeroed. Then the first column of D, is zeroed by a left transform Vi. The remaining block columns are transformed as in Algorithm 1. There is a total of F = m sweeps; the kth of them produces the k th column of B and overwrites Ak-’ with Ak = UkAk-‘Vk where Uk = diag(I,, UF, Ul, UJ, . . . > and Vk = diag(l,- 1, V,“, Vl, Vt, . . .I.

As the reductions to upper and lower bidiagonal form merely differ in the work for the first block column, the following sections will only treat the case m 2 n. The modifications necessary for reduction to lower bidiagonal form are minor. In either case, the reduction requires approximately 8bp2 floating point operations.

3. The parallel reduction algorithm

To change the serial algorithm into an efficient parallel program we proceed in three steps 181. First, we will distribute the data and the computational work to multiple processors. Then, a ‘local view’ of the block decomposition will lead to a pipelined parallel algorithm. Last, we will optimize the data layout and the algorithmic flow to improve the performance.

For the first two steps we assume that the number p of processors exactly matches the number of block columns in the initial decomposition of A and that the processors are linearly connected to form a chain. Then, the rth block column A ,,?r=l , . . . , p, as well as the work associated with that block column, is assigned to processor P,.

B. Lang /Parallel Computing 22 (I 996) I-18 5

To initiate the k th sweep, P, zeroes D, via a suitable left transform Uf. Then, I!JF is sent to P, which transforms its block column A, and thereby generates a right transform V; and a new left transform Ut. Now P, passes Ut to P, which in turn updates A, and determines two new transforms. Continuing this way, the activity moves along the chain of processors until all the block columns are processed. At this moment, the block decomposition is shifted for the next sweep. This implies that - except for P, - each processor P, must send the first column of the transformed block column A, to his predecessor P,- I.

In the above scheme, all the processors contribute to the reduction of A, but at any given time only one of them is active. A simple modification, however, allows some of the processors to work simultaneously. The key to exploiting parallelism lies in the observation that the first column of the transformed second block column A, (which will become the D, in the following sweep) is not affected by any operation on the block columns Aj, j = 3, 4,. . . . Thereford, we need not wait for completion of the kth sweep in order to initiate the (k + 1)st sweep. Instead, processor P, sends the first column of A, back to P, as soon as the work on this block column is completed. This enables P, to 1ocaZly shift its block and to zero the (k + 1)st column of the band, while P, is still transforming its block column according to the kth sweep. The same argument also applies to the other processors P,. As soon as they receive the leading column of A,+i from their successor P, + 1 they can locally shift their own block column and apply the next transformation to it.

The ‘start up’ of the resulting parallel Algorithm 2 is sketched in Fig. 3. Of course, we do not send Householder matrices as such but only the associated Householder vectors.

Algorithm 2. Parallel reduction to upper bidiagonal form, raw version.

n := my processor number for k = 1 to n

if A, is not empty if7rTTl

receive(Ui _ 1 > from P, _ 1 ET:= U$-,.E, A, :=A, . V,” where Vb zeroes the first row of E,

D, := U,” * D, where U? zeroes the first column of D, if 7r>l

send(first column of A,) to 9,- 1 shift A, to the right/bottom by one column/row if A+, is not empty

send(@) to P,+ I receivecnew last column for A,) from P,+ I

As the first sweep progresses along the band, more and more processors become active, cf. Fig. 3. Later on, as the remaining band becomes shorter, an

6 B. Lang /Parallel Computing 22 (I 996) I-18

PI p2 p3 P4 ps P6 u:

t c

step 4

step 2

step 5 z”

C Fig. 3. Computational activity and communication during the ‘start up’ of Algorithm 2. The shaded

blocks are modified during the respective time step. Light grey is used for transformations that result from eliminating the first column of the band, medium grey refers to the second and dark grey to the third sweep. qk stands for sending a Householder transform, C for the transport of a column. The

communication takes place at the end of each time step, after the computations.

increasing number of processors falls back to inactivity as their block columns are ‘empty’. On the average, only one half of the p processors that shared the band at the start of the algorithm will hold a nonempty block column. Thus, one-half of the available computational power is lost by load imbalance.

In addition, only one-half of the processors with nonempty block columns can work at any given time, as may be observed from Fig. 3. The other processors are idle because they are waiting for transformations from their left neighbors and for columns from their right neighbors. This data dependence reduces the attainable speedup by another factor of 2.

Fortunately, both obstacles can be circumvented by suitably changing the data layout, at the price of reducing the number of processors in use. First, we do not distribute the band to the processors by single block columns but by logical blocks Lj, each L, consisting of two block columns AziP and Azj. As we will see, this

B. Lang /Parallel Computing 22 (I 996) I - I8 7

completely eliminates the ‘idle waiting’ time which was caused by the fact that adjacent block columns cannot be transformed at the same time. In addition, the logical blocks are spread cyclically over the processors: if I, processors Pi,. . , P, are available then P, holds the logical blocks Lj with j = v mod p. This wrapped mapping requires the processors to be connected as a ring. If each processor holds w = n/(2pb) logical blocks of the initial band (w is called the wrap of the distribution) then the loss by load imbalance is reduced from a factor of 2 to approximately 1 + l/w. Following this rule, one might expect a speedup of almost 53 for reducing a matrix with n = 24000 and b = 40 on p = 64 processors. Due to communication and synchronization overhead, the speedup obtained in practice was slightly lower (48.6, see Section 5).

Now, in each time step of the algorithm, a processor will have to process up to [wl logical blocks, and each logical block requires modifying two consecutive block columns. To this end we subdivide each time step into two phases cp = 1, 2. Before the first phase starts, processor P, receives v transformations from its left neighbor. (V is the number of Pr’s ‘active’ logical blocks. This number varies as the algorithm progresses. As long as the activity has not yet reached the processor, it remains 0, then slowly increases to the number of logical blocks that are stored in P,, and finally decreases with the number of the nonempty logical blocks.) During the first phase the processor transforms the first block column of (the first v of) its logical blocks. Between the first and the second phase, the first columns of the transformed block columns are sent back to the left neighbor where they are appended to the corresponding second block columns. In the second phase the processor transforms the second block column of each logical block, using the left transforms generated in the first phase and generating new ones. These are passed to the right neighbor at the end of the second phase.

Algorithm 3. Parallel reduction to upper bidiagonal form, optimized data layout.

rr := my processor number l/ := 0 {number of ‘active’ (transformable) block columns) for k = 1 to n

{ - first phase of time step k - ) if 7r=l

zero D, and generate Uf {initiate the kth sweep} transform A, A i, A = 2,. . , Y + 1, using the Us from step k - 1 and

generating new Us else

transform AThi, h = l,..., ing new ‘ris

V, using the Us from step k - 1 and generat-

send(the first columns of these v block columns) to the left neighbor receive(last columns for local logical blocks) from the right neighbor Y = number of Us generated in phase 1

{ - second phase of time step k - }

B. Lang/Parallel Computing 22 (1996) l-18

transform A, h *, , , A = 1,. . . , V, using the Us from phase 1 and generating new US

send(the new Us) to the right neighbor receivecanother set of Us) from the left neighbor Y := number of Us received

In Algorithm 3 we used a ‘local’ notation for the block columns: A?r,h,ip is the cpth block column in the Ath logical block of processor P,, i.e. the block column

k=l (F=l

y=2

k=2 p=~

y=2

k=3 ,-=l

9=2

kc6 p=l ,:, :’ I

p2 p3

L? L5 LR L.? L6

I I I II rl I I I

.43A4ASALOAl5 A5Afi.411A12

I I I II

I I I I I 1

Fig. 4. Distribution of the logical blocks to the processors (for p = 3) and “start up” of Algorithm 3. The shaded block columns are modified during the respective phase. The Us indicate communication of Householder vectors, C and 2C stand for the transport of one and two columns, respectively, and ti

denotes empty communication. The communication always takes place at the end of the phase, after the computations.

B. Lang /Parallel Computing 22 (1996) l-18 9

2 * ((A - 1)~ + r - 1) + cp of the band. Each transformation of a block column also involves a right Householder transform and requires shifting the elements of the block column by one column and one row. These facts are omitted in the algorithm. The ‘start up’ of the algorithm is sketched in Fig. 4.

The communication pattern generated by this algorithm is very regular, see Fig. 4. After the first phase of each time step, the leading columns of the block columns A rr,h,l are cyclically shifted to the left, while the transforms are exchanged after the second phase in a cyclic shift to the right. As one-half of the leading columns and of the transforms can be passed within the same processor, the overall communication volume is halved as compared to Algorithm 2. (It could be further reduced by aggregating more than two block columns in the logical blocks. But then the load imbalance would become more pronounced due to a decreasing wrap w.) The algorithm is ‘self-regulating’ in the sense that the number of block columns to transform is determined from the number of Us that were received or generated in the previous phase.

4. Updating $b& matrices U and V

M (his section we will discuss several techniques for overwriting two matrices U and I’ with NJ, and IV,, resp., where A = U,BVIT. Usually, but not necessarily, U and V will be unit matrices. If the banded matrix A itself results from another reduction (e.g., from dense to banded) then U and V may be arbitrary orthogonal matrices.

It will suffice to have a detailed look at handling U in the case m z n (reduction to upper bidiagonal form); the same techniques also apply to updating I/ and to the case m <il. Let

I:=min(k+b,+(j-l)b, m)

be the indices of the first and last rows of A that are affected by UJ” (cf. Fig. 5). Updating U requires applying all the .C,$b to the corresponding columns of U. The total number of operations for all these transformations UC:, f: : $‘I:= UC:, f: : IF> . Ujk is approximately 2$m where p = min(m, n).

In the serial reduction process, it is quite simple to do this update ‘on the fly’, i.e., immediately following the computation of Ujk and its application to Dj. This update technique employs level 2 BLAS routines for the work in U.

We can even use level 3 BLAS for the update if we decouple the work on U from the reduction of A [2]. During each sweep of the reduction, the Householder transforms must be determined and applied to A in the canonical order u,“, u:, u;, 1 . . ) because each u,” depends on data modified by u,“i. Once the transforms are known this dependence no longer exists: As the transforms from one sweep involve disjoint sets of columns of U they may be applied to lJ in any

B. Lang/Parallel Computing 22 (1996) I-18

SweeD

Fig. 5. Rows of A affected by each transform V,” (only f,” is given) of the first four reduction sweeps. In this example, m = 27, b, = 3, and b, = 2. The transforms that can be aggregated into a block transform are shaded in the same grey tone.

order, see Fig. 6. We are, however, not entirely free to mix transforms from different sweeps. C$A must be preceded by qk-’ and ri,“;’ since it affects columns that are modified by these two transforms in sweep k - 1 (inter-sweep dependence).

To make use of the additional freedom, we delay the work on U until a certain number n,, of reduction sweeps k,, k, -t 1,. . . , k2 = k, + IZ,, - 1 are completed. Then the update of U is done “bottom up” by applying the transforms in the order ip& tp+1,. . . ) C$& l&- ,, I!.$~‘,, . . . , V,2- ,, . . . , UF’, V/l+‘, . . , lJ[z. This order preserves the inter-sweep dependence mentioned above. In addition, the transforms q. * with the same index j can be aggregated into a WY block transform [7J. (In Fig. 5, the transforms contributing to the same block transform are shaded with the same grey level. In Fig. 6, they belong to the same column of the diagrams.)

For the ‘regular’ block transforms (i.e. comprising nh transforms of length b), the matrices W and Y are (b + nb - 1) X n,,. The nonzero entries of Y form a narallelogram while LV contains a nonzero upper trapezoid, see Fig. 7. Multiplying

Fig. 6. Interdependence of the “left” Householder transformations L$” for the work on A (left picture) and on U (right picture). “c + fi” indicates that d cannot be determined and applied to A (cannot be applied to CO until 0 has been applied to A (to L/j.

B. Lang/Parallel Computing 22 (1996) I-18

W Y

11

I b

Fig. 7. Nonzero structure of the matrices W and Y for a ‘regular’ block transform (n, = 4, h = 5).

W to the corresponding columns of U takes approximately n,(2b + nh - l)m floating point operations, multiplication with Y’ costs another 2n,bm operations, even if the multiplication routine G E MM is able to take full aduantage of zeros. Thus, in the banded context, using the WY representation increases the overall operations count in two ways. First, the W and Y factors must be generated. Second, applying a rank-n, block transform requires more than the 4n,bm operations that would be needed for the nb single Householder transforms comprised in W and Y. On machines with a distinct memory hierarchy, however, the higher performance of the level 3 BLAS will usually more than compensate for the overhead induced by blocking the transforms.

To incorporate the update in the parallel reduction algorithm, we must first decide on the data layout for the matrices U and I/. In view of the blocked update techniques discussed below, these matrices should be distributed in a way that each processor holds a (contiguous or scattered) set of whole TOWS. Then. each processor must apply ail the Ujk to his local parts UJ:, f: : 1:) of lJ and all the I$” to his share of I’.

Including the update into Algorithm 3 is slightly more complex than in the serial case because the reduction does not proceed sweep by sweep, cf. Fig. 8. Therefore, blocking cannot be achieved by delaying the update until some sweeps are completed. In addition, each processor knows only a part of the transforms (those he generates himself and the ones he receives from his left neighbor). As all the transforms are needed to update U,,, they must be exchanged across the processors. We will discuss four methods for embedding the update into the parallel reduction algorithm. They differ in the number of communication steps and in the amount of data to send and to buffer. Technique 1 is an adaptation of the serial ‘on the fly’ update. The Householder

transforms are applied in a non-blocked fashion; therefore, only level 2 BLAS can be used. During each phase of the reduction, the processors store their local qhs in a buffer. At the end of the phase, the local transforms are exchanged across the processors. Then, each processor applies all the transforms to the appropriate columns of his U,,,. As the transforms from one phase are indepen- dent these updates may be done in an arbitrary order. This method requires a buffer of size = m/2 for holding the U,” from one phase.

12 B. Lang/Parallel Computing 22 (1996) l-18

k= 1 2

LF= 1 2 1 2

3 4 5 6 7

1 2 1 2 1 2 I 2 1 2

Fig. 8. “Left” transforms generated in each phase of the first eight time steps. White color indicates transforms generated in P,, light grey stands for P,, and dark grey for P3 (p = 3). The transforms that

contribute to the same WY block transform are connected by dotted lines (nb = 4).

For using block transforms it is necessary to delay the update until some number nb of time steps have been completed. In Fig. 8, the Ujk that can be aggregated into the same block transform are connected with dotted lines. Note that these transforms reside in the same processor. As in the serial case, the block transforms must be applied in ‘bottom up’ order to preserve the inter-sweep data dependence.

Technique II: During nb time steps, each processor stores his own L$“s in a buffer. Then, the block transforms are generated and applied in ‘bottom up’ order:

for j = j,, downto 1 if I hold the q.* for block transform j

sendcthese transforms) to my left neighbor else

receive(these transforms) from my right neighbor if the transforms did not originate from my left neighbor

sendcthem) to the left neighbor build the W and Y factors for the jth block transform apply the block transform to the appropriate columns of U,,,

Note that the transforms are broadcast by circulating them through the processor ring. This technique requires a buffer of size = n,m/p for holding the local transforms from nb time steps and two buffers of size (b + nb - 1) X b for W and Y.

6. Lang /Parallel Computing 22 f 1996) I - I8 13

Technique III tries to minimize the number of communication steps and the time that is spent waiting for transforms without using excessive buffer storage. To this end, the block transforms are partitioned into ‘exchange groups’ of 2p consecutive block transforms each. Thus, the gth group comprises the block transforms j = 2(g - 1)~ + 1,. . . , 2gp; processor P, holds the U,” for two of them: j = 2(g - 1)~ + 25~ - 1 and j = 2(g - 1)~ + 2~. For each exchange group, all the respective transforms are exchanged in a single communication step; then the processors can generate and apply the 2p block transforms:

for g = lj,,/(2p>l downto 1 send(the U,k for ‘my’ block transforms of group g> to all the other processors receivecthe U,” for the remaining 2tp - 1) block transforms of the group)

from the other processors for j = 2gg downto 2(g - 1)~ + 1

generate the W and Y factors for the jth block transform apply the block transform to the appropriate columns of U,,,

In addition to the memory requirements of Technique II, another buffer of size 2pn,b is needed to hold the transforms of each exchange group.

Technique IV minimizes the redundant computations. Instead of generating all the Ws and Ys each processor computes only his ‘local’ block transforms; then the WY factors are exchanged.

for g = 1 j,,,.(Zp>l downto 1 generate the W and Y factors for ‘my’ block transforms of group g send(these two Ws and two Ys) to all the other processors receive(2(p - 1) WY block transforms) from the other processors for j = 2gp downto 2( g - 1)~ + 1

apply the block transform j to the appropriate columns of U,oC

Now, two buffers of size 2pn,(b + nh - 1) replace the transform buffer required in Technique III. Also, two matrices of size (b + nh - 1) x nh must be sent instead of the nb Householder vectors of length b. Thus, the overall communication volume has more than doubled as compared to Technique III.

If communication is not too slow then the fourth method will perform best, otherwise Technique III may be better. These two variants, however, require workspace proportional to the number of processors. In case of tight memory one of the first two methods should be used.

5. Numerical results

The numerical experiments were performed on the Intel Paragon parallel computers at the Eidgenossische Technische Hochschule Zurich and at the Zen-


Reduction without update Reduction with update of U and V

E E ‘Z s z= a I w

500 DGEBRD. b d 40 -0 -. ~GBBRD, b * 20 ox DGBBRD. b j 10 ~A

400 HRed(6). b = ! 40 -ft H&d(G), b 3 20 + HRed(G), b A 10 A

300

Matrix size tv4atnx Sl,z

Fig. 9. Execution times for the LAPACK routine DGBBRD and the parallel algorithm HRed. The left picture gives the timings for reduction alone whereas the timings of the right picture include the update

of U and V.

tralinstitut fur Angewandte Mathematik, Forschungszentrum Jiilich GmbH. All computations were done in double precision.

The parallel algorithm H R ed and the corresponding serial algorithm D GBBR D

from LAPACK2.0 are both implemented in Fortran77. Calls to the BLAS were directed to the optimized k ma t h library by Kuck and Associates. As the timings are almost insensitive to the values of the matrix entries we used random matrices for the timings presented here. Also, the performance only depends on the total number b of off-diagonals and not on the particular values of b, and b,. Therefore, we chose 6, = b, = b/2. All matrices were square. For updating ZJ and I’, technique IV from Section 4 with n,, = 6 (denoted by HRed (6) in this Section) was optimal for most problem sizes.

We first compared the performance of H Red, run on a single processor, with the LAPACK routine DGBBR D. As can be seen in the left diagram of Fig. 9, the

Deviation from orthogonallty Residual

600

- 5 300 . 3 E 200

P 100

0

I I I I I 1 I

DGBBRD -x--- HRed(6) - :’ <

,.x. x - :x

I I I I I I I

0 100 200 300 400 500 600 700 600

1800

1600

$ 1400

2 1200

i 1000

m 800 . = 600 r s 400

z 200

0

I I I 1 I I I

_ DGBBRQ -X--- HRed(6) + ,x”

x--T _ X.’

0 100 200 300 400 500 600 700 600

Matrix size

Fig. 10. Deviation II Z/U *- I II a from the orthogonality and residual 11 UBVT - A II_ (in multiples of the machine precision E = 2.22.10-‘6) for the routines DGBBRD and HRed(6).

B. Lang /Parallel Computing 22 (1996) I-18 15

Table 1 Comparison of the different update techniques on 32 processors.

Update technique Execution time in seconds

n = 1000, b = 10 n = 2000, b = 20

I, BLACS communication 15.4 62.3

II, BLACS communication 14.6 58.9

III, BLACS communication 11.1 52.0

IV, BLACS communication 10.5 46.9

IV, Intel communication 9.3 45.7

execution time of DGBBR D increases with b, due to the increasing number of operations necessary to reduce the band. In contrast, H Red takes less time to reduce a matrix with bandwidth b = 20 than for bandwidth b = 10, although about twice as many operations are needed! This phenomenon is explained by the varying performance of the BLAS. For bandwidth b = 20, the number of blocks in the band is halved as compared to b = 10. Therefore, the reduction requires fewer calls to the BLAS while each call involves a larger block, thus more than doubling the MFlops. From b = 20 to b = 40, the performance of the BLAS again increases, but it can no longer fully compensate for the larger overall number of operations. For n = 2000, the speedups of H Red over D G BB R D range from 1.6 (b = 10) to 3.8 (b = 4O), even if H Red needs more operations (see Table 2).

If the update of U and V is included (right diagram of Fig. 9) then the overall number of operations only slightly increases with b. For moderate bandwidths, the execution time of H R e d ( 6 ) even decreases. This again is an effect of the varying BLAS performance. As H Red ( 6 1 needs fewer operations than D GBBR D, the speedups for n = 800 now range from 4.1 (b = 10) to 8.0 (b = 40).

The rounding errors introduced by both methods are of the same magnitude, cf. Fig. 10. As a rule, the errors from D GBBR D increase with b whereas those from

8 5 125

E e 60

$ = 30

Reductlan without update

500 1000 2OCG 4000 6000 16000 32000

Matrix size

-s g 1000

si 500 0” 5 E

250

2 $ 120

=

Reduction with update of U and V

250 500 1000 2000 4000

Matrix we

Fig. 11. Overall performance of the parallel algorithm HR ed for reduction alone (left diagram) and for reduction with update of U and V (right diagram).


Table 2 Approximate operation counts and memory requirements for the parallel algorithm HRed and for the LAPACK routine DGBBRD. b is the total number of offdiagonals, p = min(m, n)

Operations for reducing A Operations for updating U Operations for updating V

Storage for A

HRed

8bj? 2ml.L’

2np2 (26 + 1)n

DGBBRD

6b$ 3mp2

3np2 (b + 1)n

HRed (6) tend to decrease. The data of the diagrams were obtained for the ‘median’ bandwidth b = 20.

For almost all configurations (problem size and number of processors), technique IV for updating U and V (cf. Section 4) performed best. Table 1 gives some representative timings for runs on 32 processors. The timings reported in the first rows of the table were obtained with a program that is entirely based on the BLACS communication routines [3] available from net 1 i b. For the last row, the program was modified to also call some routines from the Intel communication library. (These modifications, however, were limited to only two crucial data distribution steps.) As the timing data suggest, this optimization can significantly improve the performance for “small” matrices, whereas the gain for larger matrices remains limited. The following performance data were obtained with the modified program (BLACS and Intel communication).

Fig. 11 gives the overall performance that can be attained by reducing matrices of various sizes on different numbers of processors. The bandwidth was kept constant at b = 40. The MFlops rates are based on the operation counts for the non-blocked update technique, as given in Table 2. They do not include the additional operations for generating the WY factors, nor do they account for the fact that - in the banded context - applying a rank nb block reflector requires more operations than nb single Householder transforms (see Section 4).

Reductmn wlhout update Reductm with update of U and V

25 50 100 200 400 800 250 500 1000 2000 4000 8000

n / p (columns per processor) rl”(W2) i p

Fig. 12. Performance per processor as a function of n/p, the number of columns per processor, and of

n”/‘/p, resp.

B. Lang /Parallel Computing 22 (I 996) I- I8 11

To demonstrate that the algorithm scales well if the problem size y1 is increased with the number p of processors, we redraw the data from Fig. 11 in a slightly different way, cf. Fig. 12. The performance per processor remains almost un- changed with respect to the number of processors if some critical parameter v is kept constant. For reduction without updating U or I/, this parameter is n/p, the number of columns per processor (or equivalently, the number of elements of A per processor). If the update is included then v = n312/p, which lies between the standard “scalability measures” n/p (number of columns of U and/or I/ per processor) and n2/p (number of matrix elements per processor). Thus, considering the fact that the reduction problem is sparse, the algorithm scales quite well.

6. Conclusions

We presented a medium to large grained parallel algorithm for reducing banded matrices to bidiagonal form and, optionally, accumulating the transformations in two orthogonal matrices U and V. Without updating these matrices, our algorithm needs slightly more operations than the rotation-based ‘standard’ method, as implemented in LAPACK, see Table 2. Nevertheless it may be superior even on a serial computer because it relies mainly on the level 2 BLAS. If U and/or V are updated then our algorithm has a lower operations count; in addition, most of the operations can be done in matrix matrix multiplications (BLAS 3). Our approach has just one major drawback: it needs about twice as much memory to hold and reduce the banded matrix.

When run on multiple processors, nearly full speedup can be obtained if the matrix is not too small. Depending on the matrix size and on the total bandwidth, the reduction alone runs at up to 15 MFlops per node on the Intel Paragon; if I/ and I/ are updated then the performance may exceed 30 MFlops per node.

Acknowledgments

The numerical experiments were performed on the Intel Paragon parallel computers at the Eidgenossische Technische Hochschule, Zurich, and at the Zentralinstitut fiir Angewandte Mathematik, Forschungszentrum Jiilich GmbH. The author wants to thank both institutions for providing him access to these machines and for the generous help from the respective support teams.

References

[l] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, LAPACK Users’ Guide, SIAM Publications, Philadelphia, PA, 2nd edition (1995).

[2] C. Bischof, X. Sun, and B. Lang. Parallel tridiagonalization through two-step band reduction, in:


Proceedings of the Scalable High-Performance Computing Conference, Knoxuilie, Tennessee, (IEEE Computer Society, 1994) 23-27.

[3] J. Dongarra and R. Van de Geijn. Two dimensional basic linear algebra communication subpro-

grams. Technical report, University of Tennessee, 1991. [4] J.J. Dongarra, J. Du Croz, I.S. Duff, and S. Hammarling. A set of level 3 basic linear algebra

subprograms. ACM Trans. Math. Soft. 16 (1990) 1-17.

[5] J.J. Dongarra, J. Du Croz, S. Hammarling and R.J. Hanson, An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft. 14 (1988) 1-17.

[6] J.J. Dongarra, D.C. Sorensen and S. Hammarling. Block reduction of matrices to condensed forms for eigenvalue computations, J. Computational and Applied Math., 27 (1989) 215-227.

[7] G.H. Golub and C.F. Van Loan, Matrix Computations (Johns Hopkins University Press, Baltimore,

MD, 2nd edition, 1989). [81 B. Lang, A parallel algorithm for reducing symmetric banded matrices to tridiagonal form, SIAMJ.

Sci. Stat. Comput. 14f6) (Nov. 1993) 1320-1338.

[9] C.L. Lawson, R.J. Hanson, D. Kincaid and F.T. Krogh, Basic linear algebra subprograms for FORTRAN usage, ACM Trans. Math. Soft. 5 (1979) 308-323.

[lo] K. Murata and K. Horikoshi, A new method for the tridiagonalization of the symmetric band

matrix, Information Processing in Japan 15 (1975) 108-l 12.

Documents

Parallel reduction of banded matrices to bidiagonal form