Efficient diagonalization of oversized matrices on a distributed-memory multiprocessor

Annals of Operations Research 22(1990)253-269 253

EFFICIENT DIAGONALIZATION OF OVERSIZED MAT~CES ON A DISTRIBUTED-MEMORY MULTIPROCESSOR

Haesun PARK*

Department of Computer Science, EE/CSci Bldg., University of Minnesota, 200 Union Street SE, Minneapolis, Minnesota 55455, USA

Abstract

On parallel architectures, Jacobi methods for computing the singular value decomposition (SVD) and the symmetric eigenvalue decomposition (EVD) have been established as one of the most popular algorithms due to their excellent parallelism. Most of the Jacobi algorithms for distributed-memory architectures have been developed under the assumption that matrices can be distributed over the processors by square blocks of an even order or column blocks with an even number of columns. Obviously, there is a limit on the number of processors while we need to deal with problems of various sizes. We propose algorithms to diagonalize oversized matrices on a given distributed-memory multiprocessor with good load balancing and minimal message passing. Performances of the proposed algorithms vary greatly, depending on the relation between the problem size and the number of available processors. We give theoretical performance analyses which suggest the faster algorithm for a given problem size on a given distributed-memory multiprocessor. Finally, we present a new implementation for the convergence test of the algorithms on a distributed-memory multiprocessor and the implementation results of the algorithms on the NCUBE/seven hypercube architecture.

1. Introduction

The Jacobi method has been one of the fastest algorithms for parallel computation of the singular value decomposition (SVD) and the symmetric eigenvalue decomposition (EVD). In this paper, we consider an important practical problem concerning the application of the parallel Jacobi methods to matrices that are over-

*This work was supported by National Science Foundation grant CCR-8813493. This work was partly done during the author's visit to the Mathematical Science Section, Engineering Physics and Mathematics Division, Oak Ridge National Laboratory, while participating in the Special Year on Num~ical Linear Algebra, 1988, sponsored by the UTK Departments of Computer Science and Mathematics, and the ORNL Algebra sponsored by the UTK Departments of Computer Science and Mathematics, and the ORNL Mathematical Sciences Section, Engineering Physics and Mathematics Division.

© J.C. Baltzer AG, Scientific Publishing Company

254 H. Park, Efficient diagonalization o f oversized matrices

sized relative to the number of available processors. In the ideal environment for developing Jacobi methods on a distributed-memory architecture, we have enough processors to distribute the matrix in a certain way: for a matrix of even order n there are n[2 x n[2 processors so that the matrix can be distributed by 2 x 2 submatrices or n/2 processors so that the matrix is distributed by column pairs over the processors. When the matrix order n is larger than twice the number of available processors p, then we say that the matrix is oversized. The problem of processing oversized matrices on a given fixed-sized array has been studied previously in the literature [ 2 , 1 1 - 1 3 ] . However, most of the investigation has been confined to matrices of which the orders are such that we can distribute them over the processors by square blocks having an even order or column blocks containing an even number of columns. When the above assumptions are not satisfied, the matrix is padded with zero columns to increase the column dimension in order to satisfy the assumptions. We first review the one- sided Jacobi methods [ 3 - 5 ] tailored for the efficient implementation of Jacobi algorithms on distributed-memory architectures. Then, we propose solutions to oversized problems and analyze the behavior of the solutions theoretically. The performance of each method varies greatly, depending on the relation between the matrix order n and the number of available processors p. The performance analyses suggest the faster technique for the given n and p. In section 5.2, we show the implementation results on the NCUBE/seven hypercube, which confirm the validity of the analysis and demonstrate that choosing a good scheme accelerates the performance greatly.

2. One - s ided Jacob i a lgor i thm

Computing the EVD of a symmetric n x n matrix A = (aij) via Jacobi methods, we repeatedly apply plane rotations to make the matrix converge to a diagonal form with eigenvalues of the matrix A on its diagonal. If we denote the Jacobi rotation in a plane (k, m) as J(k, m, 0), then the cosine and sine pair c = cos 0 and s = sin 0 annihilating the akr n and amk elements satisfy the following relation:

i . e •

( ) c o

- amk a m m / \ - s c ) = d2 '

akin (c 2 -- s 2) + (akk -- a m m ) C S = O . (2.1)

Many Jacobi algorithms differ essentially in the method of choosing the sequence of planes where the transformations take place. This sequence is called an ordering. Parallel Jacobi algorithms for the symmetric EVD are summarized in algorithm 1.

H. Park, Efficient diagonalization o f oversized matrices 255

ALGORITHM 1

Two-sided Jacobi: Given a symmetric n x n matrix A, compute its EVD.

(1) Choose an ordering and initialize V as the identity matrix I of order n.

(2) Repeat until convergence:

Sweep through the n(n - 1)/2 planes according to the chosen ordering:

2.1. Determine Jvia(2.1) 2.2. A p p l y J t o obtainA := j T A j 2.3. Accumulate J into V := VJ. []

The columns of the resulting matrix V are the computed eigenvectors and the diagonal elements of A are the computed eigenvalues.

For computing the SVD of a square matrix A, we can choose two Jacobi rotations J and K that rotate through the angles 01 and 02, respectively, satisfying the following relation to annihilate both akin and amk,

( °/ -Sl cl \ amk atom -s2 c2 d2

i.e. the relation

tan(01 + 02) = (amk + akin)[(atom - akk)

t an( -01 + 02) = (am k - akin)[(atom + a k k ) '

(2.2)

wherec i = cosO l a n d s i = sinO i ,1<~ i <~ 2. To implement the above algorithms on distributed-memory architectures, we

initially distribute the matrix by columns over the processors. The major cost of communication will be incurred if we explicitly update the rows in stage 2.2 by applying ,IT from the left of the matrix A, which requires rotation parameter passing to all the other processors since the matrix is distributed by columns. We can reduce this communication cost by implementing a one-sided variation of the Jacobi scheme [ 3 - 5 , 7 ] . In the one-sided Jacobi methods for the EVD, instead of applying the rotations from the left side of the matrix explicitly, we accumulate the rotation parameters and retrieve the matrix elements only when necessary.

ALGORITHM 2

One-sided Jacobi: Given a symmetric n x n matrix ,4 = A, compute its EVD.


(1) Choose an orderilag and initialize V as the identity matrix I of order n.

(2) Repeat until convergence:

Sweep through the n (n - 1)/2 planes according to the chosen ordering:

2.1. Determine Jvia (2.1) 2.2. Apply J to obtain .4 := A J 2.3. Accumulate J into V : = VJ.

(3) Compute the diagonal elements of A. []

In stage 2.1, we need to choose the rotation J so that the off-diagonal elements in the (k, m) and (m, k) positions of the matrix A are annihilated after the transforma- tion. At any stage, we do not have the matrix A, but the matrices .4 and V only. However, the relation (2.1) for computing the rotation parameters c and s shows that only three elements of A are necessary for computing J, which we can compute through inner products within each processor: akk = (O k, ~k), atom = (0 m, am ), and akin = (Ok, ~m )" Note that although three extra inner products are introduced, this scheme is much faster than the two-sided algorithm on a distributed-memory architecture like a hypercube since it avoids message passing for row updating, and is particularly effective if the nodes have vectorization capability.

The schemes we introduce to handle oversized problems for the EVD work exactly in the same way for the computation of the SVD. For a one-sided SVD computation, there is the well-known method by Hestenes [7]. We would like to emphasize that the rotation matrix J computed in the one-sided algorithm 2 is identical to J of the two-sided algorithm 1. If we use Hestenes' rotation parameters for computing the symmetric EVD, the method does not converge to a diagonal form for certain classes of matrices [4]. Readers interested in the parallel SVD computation can refer to [2].

3. A ring order ing

We first present a ring ordering developed by Eberlein [3]. In the next section, we extend this ordering to deal with the oversized problem. The ring ordering we present is optimal on a ring connected architecture in that it f'mishes one sweep in a minimum number of stages (n - 1 stages if n is even and n stages if n is odd), and each processor requires minimal communication in one sweep. Also, it requires only nearest-neighbor communication and is easily generalized for any order n. There exists other orderings with all these properties [5,8,9]. Although the schemes presented in this paper can be applied to other orderings, the details discussed here are for the ring ordering presented in this section.

We assume that for a matrix of an even order n we have n[2 ring connected processors so that two columns can be stored in each processor. For a matrix of an

H. Park, Efficient diagonalization of oversized matrices 257

odd order n, we assume (n - 1)/2 processors. In the following figures, each rectangle represents a processor and two numbers inside a rectangle represent the indices o f two columns allocated to the processors. We illustrate the ring ordering scheme for even n with the case n = 8 and p = 4 in fig. 1. Two schemes, I and II in fig. 1, are used alternately to generate the index set of each stage.

Fig. 1. Ring ordering scheme for n = 8 and p = n/2.

The ring ordering requires only one send and one receive per stage and finishes one sweep in n - 1 stages. The pattern returns to its original state after 2(n - 1) stages.

stage 1. (1,5) (2, 6) stage 2. (1,4) (5,6) stage 3. (1,6) (5, 7) stage 4. (1,3) (6, 7) stage 5. (1,7) (6, 8) stage 6. (1,2) (7,8) stage 7. (1,8) (7,4)

Fig. 2. One sweep of the ring

(3, 7) (4, 8) (2, 7) (3, 8) (2, 8) (3,4) (5, 8) (2, 4) (5, 4) (2, 3) (6, 4) (5, 3) (6, 3) (5, 2)

ordering for n = 8.

When n is odd, we can store one extra column in the left-most processor and apply the same scheme as in the even case assuming that the first two processors in fig. 1 are combined. We illustrate the scheme for n = 9 and p = 4 in fig. 3, where the index in the dotted box of the first processor represents the column that is not involved with any computat ion and the dashed arrow represents the column rearrange- ment within the first processor.

Fig. 3. Ring ordering scheme for n = 9 and p = (n - 1)/2.

258 H. Park, Efficient diagonalization of oversized matrices

When n is odd, one sweep is finished in n stages.

stage 1. 9 (1,5) (2,6) (3,7) (4,8) stage 2. 4 (9,5) (1,6) (2, 7) (3, 8) stage 3. 5 (9, 6) (1,7) (2,8) (3, 4) stage 4. 3 (5,6) (9,7) (1,8) (2, 4) stage 5. 6 (5,7) (9, 8) (1,4) (2, 3) stage 6. 2 (6,7) (5,8) (9, 4) (1,3) stage 7. 7 (6, 8) (5,4) (9, 3) (1,2) stage 8. 1 (7,8) (6,4) (5, 3) (9, 2) stage 9. 8 (7,4) (6, 3) (5,2) (9, 1)

Fig. 4. One sweep of the ring ordering for n = 9.

4. A l g o r i t h m s fo r overs ized p r o b l e m s

Using some examples, we will explain the notations used in the rest of the paper. The notation [1 23 4] * [5 6] denotes the set of all possible pairs between the indices in the set {1 2 3 4} and the set {5 6}; it denotes the set of pairs 1(1,5) (1,6) (2,5) (2,6) (3,5) (3,6) (4,5) (4,6)} of size 4 x 2 . The notation [1-2-3] * [4-5-6] denotes the union of the set [12 3] * [45 6] and a set ofaU possible index pairs out of the set {1 23} and out of the set {456}. Thus, it is the set of all possible index pairs out of the set { 1 2 3 4 56}, which is the set of pairs

{ (1,4) (1,5) (1,6) (2,4) (2,5) (2,6) (3,4) (3,5) (3,6) (1,2) (1,3) (2,3) (4,5) (4,6) (5,6)}

$ of size 6 x ~. The sets [1 23 4] * [56] and [1-2-3] * [4-5-6] are ordered sets, where the order is as given in the above examples. We will assume that the number of processors p is an even number since we are particularly interested in the hypercube implementation of the algorithm and it is possible to find a ring of even length in a hypercube. However, the schemes we introduce in this paper are not restricted to the case where p is even; they can be applied to any processor size p. Let Teom(n ) be the time for completing stages 2.1 to 2.3 of algorithm 2 for one pair of columns of length n, Tmsg(n) be the time for sending one vector of length n, and T s be the start- up time for communication.

In this section, we introduce two methods for applying parallel Jacobi algorithms to the oversized matrices. Three criteria we use in developing the methods are: (1) good load balancing should be achieved, (2) one sweep should finish without any repetition of any index pairs in as few stages as possible, and (3) the method should be easily applied for any given n and p with a systematic rule to generate the index pairs in each stage and reorganize them from one stage to the next. The first is the extension method, where we extend the column dimension by padding

H. Park, EfS~cient diagonalization of oversized matrices 259

the matrix with zero columns so that the matrix can be distributed by an equal number of columns over the processors, then apply the ring ordering scheme for even n on column blocks. The extension method can be subdivided according to the possible ways to distribute the extended matrix over the processors. The second is the residue method. In this method, instead of extending the column dimension as in the extension method, we distribute the columns over the processors by an equal number and store the rest in the first processor. We then apply the ring ordering scheme for odd n. The performance of each scheme varies greatly, depending on the relation between the matrix order n and the number of available processors p, which we will show at the end of this section.

4.1. EXTENSION METHOD

In this method, we pad the matrix with zero columns so that we can distribute the matrix over the processors by an equal number of columns. Let us first consider the simplest case where the matrix size is divisible by twice the number of processors. This assumption was made in most of the approaches for the oversized problems [ 1 1 - t 3 ] . After the matrix is distributed by an equal even number of columns onto the processors, we divide the columns within each processor into two bins and move the columns around by applying the ring ordering scheme for even n of section 3 .to column blocks. Figure 5 illustrates the case of n = 12 and p = 2 °

stage 1. [ - ~ ~ stage stage 3.

stage processor 1 processor 2 1 11-2-31"14-5-6 ] 17- -9J*[10-11-121 2 11 2 3]*[7 8 91 f44 6]*[z0 11 121 3 [I 2 31"I10 I1 12] Sl*[7 S 91

Fig. 5. One sweep of block ring ordering scheme for n = 12 and p = 2.

It is clear that every possible column pair is in the same processor at least once throughout 2p - 1 stages. Within each processor, if every possible column pair (i, ]), where i is from one bin and ] is from the other, is processed in each stage, then no pair will be repeated within 2p - 1 stages. The only missing pairs are the ones that can be formed within each bin. They can be recovered if they are processed in one stage of every sweep. Note that one sweep is finished in 2p - 1 stages with minimal communication.

When the matrix size is not divisible by twice the number of processors 2p, we pad the matrix with zero columns to make its column dimension divisible by 2p.

260 1-1. Park, Efficient diagonalization of oversized matrices

There are many ways to map the properly extended matrix onto the available processors. We consider two mappings: the strip mapping and the wrap mapping [10]. In the strip mapping, the matrix is partitioned into column blocks of the same size and distributed over the processors. In the wrap mapping, the columns are distributed over the processors in wrap around fashion. It is shown that the wrap mapping gives better load balancing in many other computational tasks [10]. Figure 6 shows the case where p = 2 and n = 9 for the strip mapping, where each 0 represents the column that is padded.

stagel ste s.3 Fig. 6. One sweep of the extension method with strip mapping for n = 9 and p = 2.

In stage2, while processor 1 spends 32 Teom(9 ) time for computation, processor 2 is idle, which gives poor load balancing. We can see that the problem may become worse as the matrix size becomes larger. The total computation time required for one sweep of the above example is

strip 5 Zeo m (9) = (6 x ] + 3 x 3 + 3x 3) Teom(9 ) = 33 Teom(9 ) .

In the wrap mapping, columns are distributed over the first bin of every processor and then the second bin of every processor, repeatedly until every column is stored. The wrap mapping gives better load balancing in Jacobi algorithms with the pairing scheme of fig. 5, which is shown in fig. 7.

stsge 1 . ~ stage 2 . ~ stage 3.

stsge processor 1 processor 2 21 [1-s'g]*la-7] 12-61"[4-81

[z s 9]*[2 el IS 7]*[4 8] s [z s 91"[4 s I [3 7]*[2 6]

Fig. 7. One sweep of the extension method with wrap mapping for n = 9 and p = 2.

The total computation time required for one sweep with the wrap mapping for the above example is

H. Pa r k , E f f i c i e n t d iagona l i za t i on o f o v e r s i z e d m a t r i c e s 261

wrap 4 Tco m (9) = (5 x ] + 3 x 2 + 3 x 2) Tco m (9) = 22 Tco m (9) .

Therefore, we choose to use the wrap mapping for better load balancing. The extension method can be summarized as follows. Given n and p, we

find the smallest nonnegative integer e that satisfies the relation

n + e = p m e (4.1)

for some positive integer m e. Then, e is the number of zero columns to pad the matrix with so that the matrix is distributed over the processors by an equal number of columns me . When m e is an even number, i.e. m e = 2 k e for some integer ke ,

columns are divided into two bins of the same size ke within each processor. When m e is an odd number, i.e. m e = 2 k e - 1 for some integer k e, columns are divided into two bins of sizes k e and Ice - 1. Then, the ring ordering scheme for even n of section 3 is applied to column blocks. In fig. 8, we show how the column blocks

stage I. ~ stage 2. ~ stage 3.

Fig. 8. Column distribution for one sweep for p = 2 and n = p ( k e l + ke2).

are shuffled when we apply the ring ordering scheme of section 3 for even n on the column blocks. The number inside each pair of parentheses represents the number

of columns in the bin, where kex = ke2 = Ice when me = 2ke and kel = Ice, ke2 = ke - 1

when me = 2 k e - 1. The maximum work load for each stage is summarized in table 1.

Table 1

Work load for one sweep of the extension method on p processors with n = p (kel + ke2)

Maximum work load

Stage Tcom(n) Tmsg(n) T s

1 ke2(2k e - 1) k e 1 2 1

2p - 2 k e 1

2p - 1 eke2 ke2 1


The maximum load for computation in stage i, 2 ~< i ~< 2p - 2, is ke 2 Too m (n) for even as well as odd me since there is at least one processor with 2ke columns, which dominates the work load. Thus, the total time for one sweep of the extension method for n = P(kel + ke2) on p processors, T[swe~(n) , can be measured as

T( eep(n) = ( k e 2 ( 3 k e - 1) + k2e(2P - 3 ) ) T c o m ( n )

+ (ke(2 p - 2) + ke2 ) Tmsg(n) + (2p - 1)T s. (4 .2)

4.2. RESIDUE METHOD

The residue method is different from the extension method in the fact that we do not pad the matrix with zero columns, i.e. we do not extend the column dimension. Instead, we distribute the columns over the processors by an equal number and then place the remainder in the first processor. In general, given n and p, we find a nonnegative integer r and mr satisfying the relation

n = p m r + r. (4.3)

Then, we distribute mr columns to each processor, except that the first processor gets th6 remaining r columns in addition to mr columns. In each processor, mr columns are divided into two bins as in the extension method, and then we apply the ring ordering scheme of section 3 for odd n on the column blocks. In the first stage o f every sweep, all possible pairs from two bins within each processor, thus m r ( m r - 1)/2 pairs, are processed. Additionally, in the first processor, there is one extra bin where r extra columns are stored initially, and in the first stage of every sweep, all possible pairs from the extra bin are also processed. In the subsequent stages, all possible pairs from different bins are processed in each processor. Note that one sweep takes 2p + 1 stages. In fig. 9, we show the case of n = 10 and p = 2, where mr = 4 and r = 2.

In fig. 10, we show how the column blocks are shuffled. The number inside each pair o f parentheses represents the number of columns in the bin, where k r l = k r 2 = k r when mr = 2kr and krl = kr, kr2 = k r - 1 when mr = 2 k r - 1, for some integer kr. The first stage is badly balanced unless r = 0 because of the r(r - 1)./2 extra pairs that the first processor has to process. In spite of poor load balancing in the first stage of every sweep, the residue method can perform much faster than the extension method since the load balancing in the extension method can be worse, depending on the relation between n and p. The maximum work load for each stage is summarized in table 2, from which we can see that better load balancing can be achieved if m r and r are chosen so that r is close to k r = mr./2. Thus, the total time for one sweep of the residue method for n = p ( k r l + kr2 ) + r on p processors, 7~sweep(n), can be estimated as

H. Park, Efficient diagonalization o f oversized matrices 263

r - ; ~ o : ; f - - ~ 7 - • • [-5-6"--~--34 ~ 78 I f -3 -4 " '~ " I '~ I--~--11 ~ - - x - ' ' ' ' I L - T - ' ' ~ " - - T - ' , ' ' '1

I 1 2 - ~ - T s H 58 I L--T--, , ~ - - ~

stage processor 1 i 1.2].13.4, 9-10 l

9 10]* 3 4 9 lO]'17 81 3 4]* 7 8 I 3 4]* 5 81

stage 5 . ~

I- -I I " - - T - ' ' ' '1

processor 2 5-6 * 7-8] 12 *l 7 S] I 2 "15 ~1 9 1 0 ' 5 6 9 1 0 " 1 2

Fig. 9. One sweep of the residue method for n = 10 and p = 2.

stage 3 . ~

Fig. 10. Column distribution for one sweep for p = 2 and n = P(krl + kr2) + r.

Tlrswe~p (n) = (k r2 (2k r - 1 + m a x ( k r, r)) + 2 k , max(kr2 , r) + r(r - 1)[2

+ (p - 2 )krmax(kr , r) + (t9 - 1)max(k~, kr2, r)) Teo m (n)

+ (2p + 1 )max(k r, r)Tmsg(n ) + (2p + 1)T~. (4.4)

264 1t. Park, Efficient diagonalization o f oversized matrices

Table 2

Work load for one sweep of the residue method on p processors

for n = P(kr l + kr2) +r

Maximum work load

Stage Teom(n) Tmsg(n) T 3

1 kr2(2k r - 1) + r ( r - 1)/2 max(kr, r ) 1

2 kr2 max (kr, r) max (k r, r) 1

3 m a x ( k r a, kr2r) m a x ( k r, r) 1

p + 1 max (k~, kr2 r) max (kr, r) 1

p + 2 k rmax(k r, r) max (k r, r) 1

2p - 1 k r m a x ( k r, r) m a x ( k r, r) 1

2p krmaX(kr2, r) m a x ( k r, r) 1

2p + 1 krmaX(kr2, r) max(kr2, r) l

4.3. COMPARISON OF THE TWO METHODS

The maximum work loads for one sweep of the two methods are compared in table 3, which can be used to predict the performances of the two methods for given n and p. For the residue method, (4.4) is simplified by taking the case kr2 = k r.

Table 3

Maximum work for one sweep of the block schemes

Maximum work load for one sweep

Tcom ( n ) Tmsg(n) T s

Extension ke2(3k e - 1) +k2 (2p - 3) ke(2 p - 2) +ke2 2p -- 1

Residue 2PkrmaX (k r, r) (2p + 1)max (kr, r) 2p + 1

+kr(2k r - 1 ) + r ( r - 1)/2

Table 4 shows the coefficients for Tcom(n), Tmsg(n), and T s for various matrix orders n and numbers of processors p, where kl = kel and k 2 = ke2 for the extension method and k I = krl and k2 = kr2 for the residue method. Note that the performance predictions for the two methods vary greatly, depending on the

11. Park, Efficient diagonalization of oversized matrices 265

Table 4

Performance prediction of the two methods

p n Method k 1 k 2 r T c o m ( n ) Tmsg(n) T s

4 79 Extension 10 10 790 70 7 Residue 9 9 7 822 81 9







relation between n and p. For instance, when p = 8 and n = 34, the residue method is expected to work much faster since it achieves perfect load balancing except for the first stage of every sweep, while the extension method requires column dimension expansion with six zero columns and does not achieve good load balancing in most of the stages since k~ :~ k2. However, in the case p = 16 and n = 70, the residue method would perform worse since the large number of extra columns assigned to the first processor would give poor load balancing in all stages.

5. Implementation on a distributed-memory multiprocessor

5.1. CONVERGENCE TEST

There are a number of schemes that can be used for testing convergence of the Jacobi algorithms for the SVD and the symmetric EVD computations on distributed-memory multiprocessors [6]. We introduce one of the more efficient among them. In sequential computation, the terminating criterion of

off(A (0) < tol off(A (1)) (5.1)


is often used, where

off(A(t)) ~ (0 2 10-1e = (a i j ) and t o l =

give good approximations of the eigenvalues and eigenvectors when double precision arithmetic is used [1]. We show how to implement (5.1), which is seemingly not easy to implement on a distributed-memory multiprocessor since it requires global information. The advantage of our implementation is that the computat ion and communication overhead is very small. The scheme is based on the following relations. The first is that the Frobenious norm of the matrix is preserved under orthogonal transformations, thus

ii A(1)IIF = II Z(OIIF (5.2)

for any l, where A (t) = f i t - 1)TA(t- 1)j(t- 1). The second is the relation

n n

off(A(O) = ~ (0 z (t) 2 h(1)ll~ (0 2 ( a l l ) = IIA(/)II~ - ~ = I1 (aii ) - Y (a . ) i~] i=1 i=1

(5.3)

In the first stage, every processor computes the sum of squares mysum of the columns of A (1) assigned to it:

n

rnysum = ~ ~ (uq'-(1))2 , t = 1

where the first summation is taken over all the columns of A (x) in the processor. In stage k, 2 ~< k ~< 2 p - I, each processor gets its neighbor's rnysum, accumulates it, and updates mysum so that within one sweep it has the value II AO)II~. We can simply augment the column vector size by one, attach the information, and send it around as the columns of the processors are interchanged, as we see in fig. 11. In fig. 12, for simplicity, we show only the migration of mysum for each processor i denoted as ci.

L o. I I

~sum I

Fig. 11. One typical vector in one processor


stage 1.

Fig. 12. Computation of off(A(t)).

By stage 6, every processor has received all ci's and so can have

4

ci, i.e. IIa(1)tl~. i=1

Repeating the process with the sum of squares of the diagonal elements assigned to each processor instead of the sum of squares of all the elements, each processor can have off(A 0)) by the end of the second sweep. Then from the third sweep on, by the end of the k th sweep, every processor has the value off of the matrix of the previous sweep in addition to off(A(1)). Thus, all the processors have the necessary information to stop the computation one sweep after actual convergence. However, iterating for one more sweep will only help the results.

5.2. RESULTS ON THE NCUBE/SEVEN HYPERCUBE

Our implementation results are obtained for the symmetric EVD on the NCUBE/seven hypercube. The code was written in FORTRAN 77. We performed a series of tests on up to 16 processors. The test matrices are generated randomly with a uniform distribution on the interval [ - 1 , I ] . We summarize our results in table 5. Speedup is computed based on timing of the Jacobi algorithm with the cyclic- by-rows ordering on one node. The speedup and the efficiency in percent, which is the speedup divided by the number of processors, are computed only for the faster of the two methods. Note that the theoretical and experimental timings agree very well. For example, when n = 98 and p = 16, the residue method took only about

268 H. Park, Efficient diagonalization of oversized matrices

Table 5

Performance of two methods on the NCUBE/seven

p n Method Time ( sec ) Speedup Efficiency (%)

4 79 Extension 137.740 3.525 88.12 Residue 143.690

4 81 Extension 174.988 Residue 151.310 3.459 86.48

8 33 Extension 7.678 Residue 5.609 5 A94 68.87



16 70 Extension 41.393 8.157 50.98 Residue 69.719 -

16 98 Extension 112.018 - Residue 75.978 13.837 86.48

68% of the time required for the extension method. Thus, given n and p, we can use the results of the theoretical analyses to predict the faster method.

Remarks

We have developed methods for solving the oversized problems, which finish one sweep in a minimum number of stages without any repetition of pairs. However, since good load balancing is often not achieved, we can relax this condition and make use of the idle time by repeating some pairs. For example, in stage 2 of fig. 7, since processor 2 has only four column pairs to process while processor 1 has six pairs, processor 2 may repeat two pairs, e.g. (3,4) and (3,8) after [3,7] * [4 ,8] , which would not require any extra time. The repeated pairs may change the convergence behavior of the methods greatly, especially when neither of the two methods give good load balancing, e.g. when p = 16 and n = 70.

Acknowledgement

The author is grateful to Professor P.J. Eberlein, Department of Computer Science and to Professor J.W. Mclver, Department of Chemistry, SUNY at Buffalo, for valuable discussions.


References

[1] M. Berry and A. Sarneh, Parallel algorithms for the singular value and dense symmetric eigenvalue problems, CSRD No. 761, University of Illinois (1988).

[2] R.P. Brent, F.T. Luk and C.F. Van Loan, Computation of the singular value decomposition using mesh-connected processors, L VLSI Computer Systems 1(1985)242- 270.

[3 ] P.J. Eberlein, On using the Jacobi method on the hypercube, in: Prec. 2rid Conf. on Hyper- cube Multiproce~sors, ed. M.T. Heath (1987 ) pp. 605 - 611.

[4] P.J. Eberlein, On one-sided Jacobi methods for parallel computation, SIAM J. Alg. Disc. Meth. 8(1987)790-796.

[5] P.J. Eberlein and H. Park, Efficient implementation of Jacobi algorithms and Jacobi sets on distributed memory architectures, J. Parall. and Dist. Comput., special issue on algorithms for hypercube computers, to appear.

[6] P.J. Eberlein and H. Park, Eigensystem computation on hypercube architectures, in: Pfoc. 4th Conf. on Hypercubes, Concurrent Computers, and Applications, to appear.

[7] M.R. Hestenes, Inversion of matrices by biorthogonalization and related results, J. Soc. Indust. Appl. Math. 6(1958)51 - 90.

[8] F.T. Luk and H. Park, On parallel Jacobi orderings, SIAM J. Sci. Statist. Comput. 10 (1989) 18 - 2 6 .

[9] F.T. Luk and H. Park, A proof of convergence for two parallel Jacobi SVD algorithms, IEEE Trans. on Computers 38(1989)806 - 811.

[10] D.P. O'Leary and G.W. Stewart, Data-flow algorithms for parallel matrix computations, Comm. ACM 28(1985)841- 853.

[11] R. Schreiber, Solving eigenvalue and singular value problems on an undersized systolic array, SIAM J. Sci. Stat. Comput. 7(1986)441-451.

[12] D.S. Scott, M.T. Heath and R.C. Ward, Parallel block Jacobi eigenvalue algorithms using systolic arrays, Lin. Alg. and its Appl. 77(1986)345- 355.

[13] C. Van Loan, The block Jacobi method for computing the singular value decomposition, Report TR 85-680, Department of Computer Science, CorneU University (1985).

Documents

Efficient diagonalization of oversized matrices on a distributed-memory multiprocessor