11
Parallel Computing 17 (1991) 913-923 913 North-Holland A parallel algorithm for the unbalanced orthogonal Procrustes problem * Haesun Park Computer Science Department. Unioersity of Minnesota. Minneapolis, MN 55455, USA Received March 1990 Revised December 1990 Abstract Park, H., A parallel algorithm for the unbalanced orthogonal Procrustes problem, Parallel Computing 17 (1991) 913-923. In this paper, a new iterative algorithm is presented for approximating a solution for the orthogonal Procrustes problem n-fin II AQ - B II r with QTQ = I, where A ~ R tx'~, B ~ R t×" and m > n_ The new algorithm constructs a transformation Q by computing a sequence of plane rotations. Thus, it is simpler than the existing algorithms that use the recursive singular value decomposition updates, and can be efficiently implemented in a parallel environment. We give an analysis that shows how the new algorithm approximates the solution matrix Q and present an example for which the existing algorithm does not give the correct answer_ Numerical experimental results show that the new algorithm has favorable convergence results. Finally we discuss the parallel implementation of the algorithm on a parallel architecture. Keywords. Least squares; orthogonal Procrustes problem; singular value decomposition; iterative algorithm; parallel implementation; linear array of processors. 1. Introduction Suppose two matrices A ~ R t×'~ and B ~ R t×n are given, and we want to find the best least squares approximation of the target matrix B with AQ among all possible matrices Q ~ R TM with orthonormal columns. In other words, we want to find Q ~ R TM to achieve rain II AQ - B II F (1,1) where QTQ = I. This is called the orthogonal Procrustes problem in factor analysis [8]. When two matrices A and B have the same number of columns, rn --- n, then since Q is orthogonal, and II ao - B II2 = trace(A'rA) + trace(B'rB) - 2trace(OxA'rB), (1.2) we have only to find an orthogonal matrix Q that maximizes the value trace(QxA'rB). It is known that the solution is Q = UV T, when A'rB has its singular value decomposition (SVD) U~V "r [4]. In this paper, we consider the case when two matrices A and B have different numbers of columns, m > n, which we refer to as the unbalanced orthogonal Procrustes problem. Our interest to develop a new algorithm for the unbalanced orthogonal Procrustes * This work was partly supported by National Science Foundation grant CCR-8813493. 0167-8191/91/$03.50 © 1991 - Elsevier Science Publishers B.V. All rights reserved

A parallel algorithm for the unbalanced orthogonal procrustes problem

Embed Size (px)

Citation preview

Page 1: A parallel algorithm for the unbalanced orthogonal procrustes problem

Parallel Computing 17 (1991) 913-923 913 North-Holland

A parallel algorithm for the unbalanced orthogonal Procrustes problem *

H a e s u n P a r k

Computer Science Department. Unioersity of Minnesota. Minneapolis, MN 55455, USA

Received March 1990 Revised December 1990

Abstract

Park, H., A parallel algorithm for the unbalanced orthogonal Procrustes problem, Parallel Computing 17 (1991) 913-923.

In this paper, a new iterative algorithm is presented for approximating a solution for the orthogonal Procrustes problem n-fin II AQ - B II r with QTQ = I, where A ~ R tx'~, B ~ R t×" and m > n_ The new algorithm constructs a transformation Q by computing a sequence of plane rotations. Thus, it is simpler than the existing algorithms that use the recursive singular value decomposition updates, and can be efficiently implemented in a parallel environment. We give an analysis that shows how the new algorithm approximates the solution matrix Q and present an example for which the existing algorithm does not give the correct answer_ Numerical experimental results show that the new algorithm has favorable convergence results. Finally we discuss the parallel implementation of the algorithm on a parallel architecture.

Keywords. Least squares; orthogonal Procrustes problem; singular value decomposition; iterative algorithm; parallel implementation; linear array of processors.

1. I n t r o d u c t i o n

S u p p o s e t w o m a t r i c e s A ~ R t×'~ a n d B ~ R t×n a re g iven , a n d we w a n t to f i n d t he b e s t l e a s t

s q u a r e s a p p r o x i m a t i o n o f t he t a r g e t m a t r i x B w i t h A Q a m o n g al l p o s s i b l e m a t r i c e s Q ~ R T M

w i t h o r t h o n o r m a l c o l u m n s . In o t h e r w o r d s , w e w a n t to f i n d Q ~ R T M to a c h i e v e

rain II A Q - B II F ( 1 , 1 )

where QTQ = I. This is called the orthogonal Procrustes problem in factor analysis [8]. When two m a t r i c e s A a n d B h a v e the s a m e n u m b e r o f c o l u m n s , rn --- n , t h e n s i n c e Q is o r t h o g o n a l ,

a n d

II a o - B II2 = t r a c e ( A ' r A ) + t r a c e ( B ' r B ) - 2 t r a c e ( O x A ' r B ) , ( 1 . 2 )

we h a v e o n l y to f i n d a n o r t h o g o n a l m a t r i x Q t h a t m a x i m i z e s t h e v a l u e t race(QxA'rB) . I t is

k n o w n t h a t t he s o l u t i o n is Q = U V T, w h e n A'rB h a s i t s s i n g u l a r v a l u e d e c o m p o s i t i o n ( S V D )

U ~ V "r [4]. I n th i s p a p e r , we c o n s i d e r t he c a s e w h e n two m a t r i c e s A a n d B h a v e d i f f e r e n t

n u m b e r s o f c o l u m n s , m > n, w h i c h we r e f e r to as t he unbalanced o r t h o g o n a l P r o c r u s t e s

p r o b l e m . O u r i n t e r e s t to d e v e l o p a n e w a l g o r i t h m for t he u n b a l a n c e d o r t h o g o n a l P r o c r u s t e s

* This work was partly supported by National Science Foundation grant CCR-8813493.

0167-8191/91/$03.50 © 1991 - Elsevier Science Publishers B.V. All rights reserved

Page 2: A parallel algorithm for the unbalanced orthogonal procrustes problem

914 H. Park

problem is motivated by the problem of the direct filter design for wideband source localization in signal processing [10]. Full details of the problem can be found in [8,10].

A matrix Q • R ' ' '×' ' , with m > n, with orthonormal columns does not satisfy the relation trace(QrA-rAQ) = trace(A-rA) in general. Thus, in the unbalanced case, we have

II A Q - BII 2F = trace(B'rB) + trace(QTArAQ) - 2trace(QrAVB) (1.3)

and we need to find a transformation Q that minimizes trace(QrAVAQ) - 2trace(Q-rArB). The existing algorithms [5,11] approximate the solution of the unbalanced problem by repeatedly solving balanced problems. Initially the matrix B is extended to an / × m matrix /~ by padding rn - n columns to it. Then, the following two steps are taken iteratively: 1. The balanced orthogonal Procrustes problem for the matrices A and /~ is solved to yield an orthogonal matrix Q = (Q1 Q2), where QI • R"×'~ and Q2 • R " × ( " - " ) -

2. The last m - n columns of /~ are replaced by AQ2. Green and Gower [5,11] originally proposed that the matrix B be extended by padding it with m - n zero or random columns. Ten Berge and Knol [11] showed that padding the matrix B with AE, where the columns of the matrix E are the rn - n eigenvectors associated with the rn - n smallest eigenvalues of AVA, which they called a conditional start, gives superior results in terms of sensitivity to local minima as well as in computation time. In any case, the above two steps involve recursive singular value decomposition updates, which are costly and not easy to restructure to produce a parallel algorithm with high efficiency. Another drawback of this approach is the larger dimension of B, which may cause great inefficiency especially when rn >> n. In Section 4, we present an example for which the existing algorithm with the zero padding or the conditional start does not converge while our algorithm does. We summarize the algorithm due to Green and Gower with the conditional start of Ten Berge and Knol in Algorithm 1, which has been the best existing algorithm.

Algorithm 1. Given A • R I×"' and B • R ~x" with m >~ n, find Q • R "×n with QTQ = I that

minimizes II A Q - B II F 1. Find a matrix E = ( w , , , w _~ . . . . . w ~ + ~ ) • R "x t " ' - " ) where columns of E are the right

singular vectors of Aassociated with the m - n smallest singular values of A. 2. /~ := (B A E ) 3. Repeat until convergence

3a. Compute the SVD of Ar/~ = U.SV r

Q := u v r = (Q1 Q2), where QI • Rm×n and Q2 • R ' × ( " - " ) 3b. B = (B AQ2)

In our approach, we construct the matrix Q • R ' ' x " from a sequence of plane rotations. The rest of the paper is organized as follows. In Section 2, in order to develop an algorithm for the unbalanced orthogonal Procrustes problem, we first discuss an algorithm for computing the SVD of a product of two matrices due to Heath et al. [6], which gives the solution for the balanced Procrustes problem. This algorithm is based on the Jacobi algorithm for the SVD of a matrix and achieves high numerical accuracy by avoiding the explicit product. In Section 3, we show how we transform the unbalanced problem into an equivalent balanced problem where a new target matrix is only partially specified and develop a new efficient algorithm to approximate the solution for the unbalanced orthogonal Procrustes problem. The convergence of the new iterative algorithm is discussed in Section 4 where an example that illustrates the nonconvergence of Algorithm 1 is also given. Numerical experimental results using MATLAB on SUN 3/80 are given in Section 5, and we illustrate an efficient parallel implementation on a linearly connected array of processors in Section 6.

Page 3: A parallel algorithm for the unbalanced orthogonal procrustes problem

The unbalanced orthogonal Procrustes problem 915

2. The SVD of a product of two matrices

2 x 2 __ /cos(O) sin(O) When a 2 × 2 rotation matrix X ~ R - ~-si,~0~ cos~a)J is given, a plane rotation matrix J - R ( m , i, j , X ) of order m through the angle 0 in the (i, j ) plane differs from the identity matrix in four elements at the intersections of rows and columns i and j:

~i = cos( 0 ) Jis = sin( 0 )

~ , = - s i n ( 0 ) S a = cos(0) . (2.1)

Suppose we have two matrices A ~ R t× ' ' and B ~ R ~×" and wish to compute the SVD of the product P - ATB ~ R ''×"'. The algorithm due to Heath et al. for computing the SVD of P [6] is an elegant application of Jacobi's algorithm for computing the SVD of a single matrix. The gist of their algorithm is in that the product matrix P is never explicitly formed, thus the algorithm achieves high numerical accuracy [6]. In their algorithm, we start with A m = A and B °) = B and at each step k, the SVD of a 2 × 2 submatrix in the (i, j ) plane of p~k)= A~k)rB~k) is computed. The four elements of p~k) necessary to compute the rotations X1 and X z to achieve

p)k) _(k)/ n(k+l) ejj ) ~'jj

are computed via

(k) ~. a(k)T/7(k) Pij i ~j

where a, and b) denote the ith and j t h columns of A and B, respectively. The resulting rotations are then applied separately to A ~k) and B ~k) to obtain

A ~ k + l l = A ~ k ) R ( m , i, j , X1) and B ~ ' + l ) = B ~ k ) R ( r n , i, j , )(2).

The above algorithm gives the solution Q for the orthogonal Procrustes problem when two matrices A and B have the same numbers of columns, since the solution matrix Q is U V v when the matrix P = ATB has the singular value decomposition U~,V -r. For our new algorithm development, we explain the above algorithm from a different point of view. Suppose we have the balanced orthogonal Procrustes problem

11A(I)Q - B(1)[I F where A m = A, B (1) = B ~ R t×"'. (2.2)

We can solve it by repeatedly choosing two indices (i, j ) and finding the rotation that best matches columns i and j of the matrix A Ck) to the corresponding columns of the matrix B ~k). At each step k, the answer is the 2 × 2 rotation matrix X = X~Xz v when the SVD of (a~k) "J-~k)~'r~'~k~) tu i b) k)) is X1DX ~. The matrices A ~k) and B ~k) are updated as

A~k +l)= A~k)jl~k) and B~k+l)= BO~)Jz(k),

where j(k) = R ( m , i, j , X1) and j~k) = R ( m , i, j , )(2) and the above process is repeated with updated matrices A ~k+~) and B ~k+~). After convergence is achieved, the solution Q to any subproblem of II A~k)Q -- B~k)I[ F will be an identity matrix and since

II AIk'Q - BCk' II r---~ [1 AmJ~¢~)J~ ¢~) "'" j~k)Q - ~ " 2 o 2 ' q ( I ) / ( l ) r ( 2 ) " ' " J2~k) ]IF

= A m J ' ¢ " " J ' ~ k ' o ( J ~ " ' ' " J ~ k ' ) V - - B ~ " r (2.3)

we can take J(~) . . - J~°')(J2~) . - - j~k))T as an approximation to the solution after k iterations. We point out that if we update only the matrix A ~k) from its right by J~k)J~ k)~ at each step

k, then we may fail to find Q that minimizes I1AmQ - B m 1[ F, even though the two columns

Page 4: A parallel algorithm for the unbalanced orthogonal procrustes problem

916 H. Park

of A (k) are still best matched to the corresponding columns of B tk) at each step k. The following example illustrates this. Let ( 00)

A = - 3 4 - 3 and B = 1 0 . - 3 - 3 4 0 1

Then any relevant 2 X 2 submatrix of A'rB is symmetric positive definite, and it has the singular value decomposition X~DX~ where X 1 = X 2. Accordingly, only identity transformations X 1 X1T are generated, while the SVD of ArB = UNV "r does not satisfy the relation U = V since A is not symmetric semi-definite. Thus, the computed solution is the identity matrix that gives the residual II A - B I I r = 9, while the actual solution Q gives IN AQ - B )l r = 8.544.

3. New algorithm

We develop a new algorithm for finding a solution to the unbalanced orthogonal Procrustes problem

rain II A Q - B IIr with QTQ = I (3.1)

where A ~ R ~×"', B E R t×n, and m > n. We compute the solution Q by constructing a sequence of plane rotations. The use of plane rotations makes the new algorithm simple to manipulate and amenable to parallel implementation. We first transform the unbalanced orthogonal Procrustes problem into a balanced problem [8] with a new target matrix which is only partially specified. We consider the partially specified orthogonal Procrustes problem with the new target matrix /) ~ R ~×" that has the same number of columns as the matrix A, where only the first n columns are specified, which are the n columns of the matrix B. Then we find an orthogonal matrix Q ~ R ' ' × " that best matches the first n columns of AQ to those of/~, i.e. we find Q that minimizes

/ n / n

- b u ) = Y'. Y~ ( s i j - b i y ) 2 w h e r e S = A 0 . (3.2) i ( s ) = E E (,,, - 2 i = l j = l i = 1 j = l

Then the first n columns of 0 are taken as the solution matrix Q for the unbalanced orthogonal Procrustes problem (3.1). Note that the original unbalanced problem (3.1) and the partially specified problem to find the minimum of f ( S ) have the same solution. The idea of approximating 0 by a product of plane rotations for the partially specified Procrustes problem was first introduced by Browne [1,7]. Combining this idea with the algorithm of Section 2, we solve the problem (3.1). Let C ~ R ~×k be a submatrix of the matrix A consisting of k columns of A and let IA(C ) denote the ordered set of the indices of the columns of the matrix A that constitute C. We construct a sequence of subproblems so that every two columns of A are matched to the specified parts of the corresponding two columns of /} as follows. First, we form a matrix C ~ R Ix= with two columns of the matrix A. Suppose that IA(C) = (i, j } where 1 ~< i < j ~< rn. We form another matrix D ~ R t×2 with the ith and j t h columns of the matrix /~. If j ~ n , then the ith and j t h columns of /~ which are the ith and j t h columns of B, respectively, are both specified. Thus, we solve

rain I ICX- OII r (3.3)

for a 2 x 2 rotation matrix X. Since the matrices C and D have two columns each, this is a typical subproblem of the previous section. If 1 ~ i ~< n and n + 1 ~<j ~ m, then since only the ith column of /~ is specified, which is the first column of the target matrix D, we solve

I

min Y' (e n - da) 2 (3.4) i = 1

Page 5: A parallel algorithm for the unbalanced orthogonal procrustes problem

The unbalanced orthogonal Procrustes problem 917

with E = C X for a 2 x 2 rotation matrix X. The problem (3.4) can be solved by finding a vector x = (~) where c = cos(0) and s --- sin(0) for some angle 0 that minimizes II Cx - d II 2 where d is the ith column of the matrix B. For a solution to the problem (3.4), see [3,7]. When both i and j are larger than n, none of the corresponding columns of /~ is specified, so we skip_ computat ion. By sweeping through all possible subproblems repeatedly, we get the matrix Q whose first n columns form the matrix Q. We summarize these in Algorithm 2. The subscript 1 : n of a matrix denotes the first n columns of the matrix.

Algorithm 2. Given A ~ R /× ' ' and B ~ R t×" with m >~ n, find Q ~ R ' ' × " with Q-rQ = I that minimizes II A Q - B II r 1. U : = I , , ; V : = l , 2. Repeat until convergence 3. For i = l : n

3.1. For j = i + l : n compute the SVD of ( a i a j )V(bi bj) = X12X2 r

J , : = R ( m , i, j , X1); J 2 : = R ( n , i, j , X2) A :=AJ~; B : = B J 2 u : = uJ,; v : =

3.2. For j = n + l : m

solve rain I](ai a j ) x -- b i II 2 with xTx = 1 : = / x ( l ) -x(2) x

Xl ~ , x ( 2 ) x(1)/

• /1 := R ( m , i, j , )(1) A := A J 1 U:= UJ~

4. Q := UV(~,

We call one full iteration of Step 3 of Algorithm 2 a sweep. One sweep consists of n ( 2 m - n - 1) /2 transformations. For example, the index pairs of columns of A and B generated by one sweep of Algorithm 2 when m = 5 and n = 3 is

( 1 2 , 1 2 ) (13, 13) ( 1 4 , 1 ) ( 1 5 , 1 ) (23, 23) (24, 2) (25, 2) (34, 3) ( 3 5 , 3 ) ,

where the first two integers in the parentheses denote the index pair of columns of A and the second two integers (or one integer) denote the index pair (or an index) of columns (or a column) of B. One possible stopping criterion is that the iteration is repeated until null t ransformations are generated for one complete sweep. The new algorithm is similar to the Jacobi-type algorithms for the eigenproblem [2], thus it has excellent inherent parallelism.

4. Convergence

If two matrices A and B have the same number of columns, then Algorithm 2 clearly produces the orthogonal matrix Q that minimizes II A Q - B II F since it computes U V -r when the SVD of A rB is U 2 V v. To study the convergence properties of Algorithm 1, we define the residual r tk) of the k th t ransformation as

r,~,) A,k , "'" J,'*- , : . - B " ' J g 1' . . F. (4.1) = B ' * ' I I F = I I A ' ' J ' I ' J I ' 2 ' " .

We will first show that the sequence of residuals ( r ~k)} is decreasing. To prove this, we consider two residuals r ~k-l) and r ~k) of stages k - 1 and k, respectively. Assuming that l c ( A ~ ) ) = (i, j} and that

= ° - . - B' -"llr (4.2) r,k-1) II

Page 6: A parallel algorithm for the unbalanced orthogonal procrustes problem

918 H. Park

we divide the proof into two cases: Case 1. When 1 ~< i < j ~< n. Then

r'*'=llA'*-"R(m, i, j , X , ) , : , - B ' k - ' ) R ( n , i, j , X2)HF, (4.3)

and since X~ and X 2 are chosen to minimize

b'*-"), (4.4)

and no column of A ~k-~) with index larger than n is changed in the kth step, clearly we have r (k} <~ r ( k - l )

Case 2. When 1 ~< i ~< n < j ~< m. Then

r ' k ' = ] ] A ' k - ' ) R ( m , i, j , X , ) , : , , - B'k- ' )Hr. (4.5)

Since the first column of X~, ('~), is chosen to minimize

a ( k - I ) n ( k - I ) ( , oj 2 (4.6)

and the j ( > n ) t h column of AIk)R(i, j, X~) does not affect the residual r ~*), we have r ( k ) << r ( k - l )

Finally, we present an example for which our new algorithm, Algorithm 2, converges but Algorithm 1 fails to converge to the global minimum. Let (300) (30)

A = 0 2 0 and B = 0 1 . 0 0 0 0 0

Algorithm 2 gives the global minimum (, 0) I [ A Q - B I I F of zero with Q = 0 1 /2 .

0 7~-/2

But according to Algorithm 1, we pad the matrix B with a zero vector, since Ae 3 = 0 where e 3 = (0 0 1) v is the eigenvector associated with the eigenvalue 0 of AVA. Thus, in this case, the conditional start due to Ten Berge and Knol is the same as padding the matrix with zero columns. Then Step 4 of Algorithm 1 produces the identity matrix

1 0 O) Q = ( Q , Q2) = 0 1 0

0 0 1

and nothing changes by Step 5 of Algorithm 1 since AQ2 is a zero vector, yielding a value

1 0 [ [ A Q - B I I F = l W i t h Q = 0 1

0 0

If rank(A) ~< m - n, then the conditional start of Ten Berge and Knol is the same as padding the matrix with zero columns.

5. Numerical experiments

In this section, we present the results of numerical experiments obtained with two algorithms presented in this paper. The goal of the numerical experiments is to study the convergence behavior of the algorithm rather than comparing the efficiency or speed of two algorithms.

Page 7: A parallel algorithm for the unbalanced orthogonal procrustes problem

The unbalanced orthogonal Procrustes problem 919

Table 1 Average (min, max) of 30 iterations

A B Algorithm 1 Algorithm 2 SVD's (min, max) Sweeps (min, max)

10 x 5 10 x 4 26.467 (15, 40) 16.233 (4, 41) 10 x 3 25.067 (18, 41) 27.967 (9, 77) 10 x 2 19.733 (11, 59) 26,600 (9, 47)

Both algorithms were coded in MATLAB on a S U N 3 / 8 0 with the machine precision = 2.2204E - 16. The test matrices were generated with random numbers in the interval [0,1]

with uniform distribution. In each table that follows, the total number of SVD's to satisfy the required stopping criteria appears under Algorithm 1 and the total number of sweeps under Algorithm 2. With preliminary QR decomposit ion, one sweep of Algorithm 2 consists of solving n ( n - 1 ) / 2 subproblems of type (3.3) and n ( m - n) subproblems of type (3.4). Subproblems of type (3.4) are solved according to the algorithm in [3,7,12], which requires finding roots of polynomials. If we use the algorithm of Heath et al. [6] to compute a product SVD, then one sweep of Algorithm 1 consists of solving m(m - 1 ) / 2 subproblems of type (3.3).

Table 2 Total sweeps needed for Algorithm 2 to achieve a residual of size at least as small as r <'~:) and r ~s~°) of Algorithm 1

A B Algorithm 1 Algorithm 2 Algorithm 1 Algorithm 2 SVD's Sweeps SVD's Sweeps

10×5 1 0 x 4 3 2 10 8 3 1 10 3 3 2 10 5 3 2 10 6 3 2 10 4 3 1 10 4 3 1 10 13 3 2 10 7 3 2 10 2 3 4 10 5

10×5 10x3 3 1 10 20 3 1 10 10 3 1 10 4 3 1 10 4 3 1 10 2 3 2 10 5 3 1 10 8 3 1 10 13 3 1 10 7 3 1 I0 3

l O x 5 10x2 3 I I0 20 3 1 I0 3 3 I 10 5 3 I I0 8 3 I 10 I I 3 I 10 15 3 I I0 5 3 1 I0 40 3 1 10 18 3 I I0 3

Page 8: A parallel algorithm for the unbalanced orthogonal procrustes problem

920 H Park

Entries of T a b l e 1 are the averages of 30 tests on r a n d o m matr ix pairs (A, B) with m i n i m u m and m a x i m u m in parentheses for each pair of matr ix sizes. First, Algor i thm 1 is t e rmina ted

when the difference between the residuals of two consecut ive sweeps, I r ~'k) - r c'`-,> I, is less than 10 -5, where r ~s') denotes the residual at the end of the sweep i. Then Algor i thm 2 is

s topped whenever the j t h sweep yields r % ~ < r ~s~}. For each test, Algor i thm 2 successfully

achieved the small residual that was produced as a result of i terat ions in Algor i thm 1.

For T a b l e 2 we implemented Algor i thm 1 for a fixed n u m b e r of i terat ions and checked how

m a n y sweeps Algor i thm 2 requires to achieve residuals that are at least as small as those of

Algor i thm 1. We tested three and ten SVD updates for Algor i thm 1 (Step 3), respectively,

which is based on the claim of S ivanand and Kaveh [10] that for their appl icat ion, on ly two or

three SVD iterat ions of Algor i thm 1 give good enough results and more i terat ions do not

necessarily improve the answer. We executed 10 tests for each pair of matr ix sizes of A and B.

We can see from T a b l e 2 that, in most cases, Algor i thm 2 rapidly reaches the residuals which

are less than those from Algor i thm 1, requir ing only very few sweeps. This i l lustrates that

Algor i thm 2 is a bet ter a l ternat ive method for solving p rob lems in signal processing appl ica-

tions, a l though our test is conducted only on real matrices while some appl ica t ions give rise to

complex matrices. For real time computa t ion , the advantage of Algor i thm 2 is even greater

Table 3 Results of iterating until r ~s~ < 10 -5 when there is an exact fit

A B Algorithm 1 Algorithm 2

SVD's I l hO- BIIF IIQIQIIF Sweeps IIAQ- BII~ IIQ-QIIF

10×5 10×4 71 E-6 E-6 21 E-6 E-5 93 E-6 E-6 26 E-6 E-5 96 E-6 E-6 16 E-6 E-5

100 E-3 E-3 20 E-6 E-6 100 E-5 E-5 32 E-6 E-5 100 E-4 E-4 9 E-6 E-6 100 E-4 E-4 8 E-6 E-6 100 E-3 E-3 23 E-6 E-5 100 E-4 E-4 12 E-6 E-6 100 E-5 E-5 7 E-7 E-6

10×5 lOx3 82 E-6 E-6 69 E-6 E-5 83 E-6 E-6 44 E-6 E-5 93 E-6 E-6 17 E-6 E-6

100 E-4 E-4 13 E-6 E-5 100 E-3 E-3 24 E-6 E-5 100 E-3 E-3 17 E-6 E-5 100 E-3 E-3 35 E-6 E-5 100 E-3 E-3 16 E-6 E-6 100 E-4 E-4 10 E-6 E-6 100 E-4 E-4 14 E-6 E-6

10×5 lOx2 80 E-6 E-5 34 E-6 E-5 100 E-3 E-3 29 E-6 E-5 100 E-4 E-3 9 E-6 E-6 100 E-3 E-3 19 E-6 E-5 100 E-3 E-2 10 E-6 E-6 100 E-3 E-3 34 E-6 E-5 100 E-4 E-4 12 E-6 E-5 100 E-4 E-4 38 E-6 E-5 100 E-5 E-5 29 E-6 E-5 100 E-4 E-4 11 E-6 E-6

Page 9: A parallel algorithm for the unbalanced orthogonal procrustes problem

The unbalanced orthogonal Procrustes problem

II ~ ~

Fig. 1. Ring ordering scheme for n = 8.e

921

since it can be implemented on parallel architectures with high efficiency. Table 3 shows the implementat ion results when solutions that give residual of zero exist. A

random matrix Q with or thonormal columns was generated from a QR decomposit ion of a random matrix, then B is constructed as A Q with a random matrix A. Both algorithms are executed until the residual is less than 10-5 or the number of SVD's for Algorithm 1 and the number of sweeps for Algorithm 2 reaches 100, respectively. Upon exit, the norm of the difference between the computed solution Q and the actual solution Q, and the residual II A~ - B IIr are computed. Each entry of E - c indicates that it is between 10 -c and 10 -c+1.

We present 10 implementat ion results for each pair of matrix sizes. We can see that Algorithm 2 is especially superior to Algorithm 1 in this case.

6. Para l l e l i m p l e m e n t a t i o n

We discuss the parallel implementat ion of Algorithm 2 on a linear array of processors, which is similar to the implementat ion of the Jacobi algorithm for the SVD or the eigenvalue

511

39 511 612

17 2s I 511 39

17 ~ 410

28 39 I

17 410 1 6 L2 28

stage

1

processor 1 processor 2 processor 3

(14, 14) (17, I) (47, 4) (1 I0, 1) (4 10, 4)

(13,13) (19,1) (37,3)

(15,1 5) (1 11,1) (57,5)

(12, 1 2) (18,1) (27,2)

(16,16) (I 12, 1) (67, 6)

(25, 25) (28.2) (58, 5) (2 11, 2) (5 11, 5)

(4 5, 4 5) (4 11.4) (5 10, 5)

(4 6, 46) (4 12, 4) (6 10, 6)

(5 6, 56) (5 12, 5) (6 11, 6)

(35 ,35 ) (59, 5) (3 11, 3)

(3 6, 36) (3 9, 3) (69, 6) (3 12. 3) (6 12, 6)

(2 6, 2 6) (2 12, 2) (6 8, 6)

(23, 23) (29, 2) (3 8, 3)

(3 4, 3 4) (4 9, 4) (3 I0, 3)

(2 4, 24) (4 8, 4) (210, 2)

Fig. 2_ One sweep on three processors for m =12 and n = 6_

Page 10: A parallel algorithm for the unbalanced orthogonal procrustes problem

922 H. Park

p r o b l e m s [2,9]. Fo r s impl ic i ty of discussion, we assume that the n u m b e r of co lumns , m, of A is a mul t ip le of the n u m b e r of co lumns , n, of B, i.e. m = k t n for some integer k t, which is a lways the case in the app l i ca t i on for the direct fi l ter design of s p a t i o - t e m p o r a l p rep roces s ing [10]. Also, we assume that we have p processors and n is an even mul t ip le of p; n = 2k 2 p for some integer k 2, thus m = 2 k l k 2 p . We can use var ious o rder ings which are deve loped for the para l le l i m p l e m e n t a t i o n of Jacob i a lgor i thms [2]. We choose to i l lus t ra te the i m p l e m e n t a t i o n with a r ing o rde r ing [2] on a r ing connec ted l inear a r ray of processors , where two schemes i l lus t ra ted in Fig. 1 are app l i ed a l ternate ly .

Af t e r each processor is d iv ided into two bins, we d i s t r ibu te the co lumns of A in wrap a r o u n d fashion: co lumns are d i s t r ibu ted over the first b ins of all the p rocessors and then the second bins of all the processors [9]. Then the co lumn i of B is s to red in the same bin as the co lumn i of A, for 1 ~< i~< n. There after, co lumns of B a lways move a round toge ther with the co r r e spond ing co lumns of A. If we app ly the r ing o rde r ing scheme of Fig. 1 to each bin as one

unit of co lumn block, then all the poss ib le co lumn pairs , (0 , 0 ) and (0, i), for one sweep can be formed th roughout 2 p - 1 stages. If only those co lumn pairs , (~, 9 ) and (0, i), where i is f rom one bin and j is from the other, in each processor are processed in each stage, then there will be no repe t i t ion within 2 p - 1 stages. To recover the miss ing pairs, which are fo rmed with the co lumns in the same bin, we process every poss ib le pa i r in each processor in the first s tage of every sweep. Thus one sweep is comple t ed in 2 p - 1 s tages with the same work load on each p rocessor in each stage. The case of m = 12, n = 6, and p = 3 is i l lus t ra ted in Fig_ 2, which

shows the co lumn pairs to be processed by each processor in each stage. One can see that each processor has exact ly the same a m o u n t of work for each s tage achieving good load ba lanc ing .

7. Remarks

We present examples which show that the exis t ing a lgo r i thm for the u n b a l a n c e d Procrus tes p rob l em can b reak down, The new a lgor i thm presen ted is shown to give good a p p r o x i m a t e so lu t ions and is a m e n a b l e for para l le l c o m p u t a t i o n . The convergence rate of the a lgo r i t hm may be improved with preprocess ing. F o r example , with a d i f ferent in i t ia l r e a r r a n g e m e n t of the co lumns of the mat r ix B, Algor i t hm 2 requires a d i f fe ren t n u m b e r of sweeps_ Theore t i ca l cha rac te r i za t ion of the unba l anced o r thogona l Procrus tes p r o b l e m as well as more extens ive s imula t ion on complex matr ix pai rs are needed.

Acknowledgement

The au thor would like to thank the referee whose va luab le suggest ions i m p r o v e d the p resen ta t ion of the p a p e r and Prof. F.T. Luk for va luab le discussions .

References

[1] M_W. Browne, Orthogonal rotation to a partially specified target, Brittsh J. Math. Statist. Psychol. 25 (1972) 115-120.

[2] P.J. Eberlein and H. Park, Efficient implementation of Jacobi algorithms and Jacobi sets on distributed memory architectures, J. Parallel Distributed Cornput., Special Issue on Hypercube Algorithms, 8 (1990) 358-366.

[3] G.E. Forsythe and G_H. Golub, On the stationary values of a second-degree polynomial on the unit sphere, J. Soc. Indust. AppL Math. 13 (1965) 1050-1068.

[4] G. Golub and C. Van Loan, Matrix Computations (John Hopkins, Baltimore, MD, 1983).

Page 11: A parallel algorithm for the unbalanced orthogonal procrustes problem

The unbalanced orthogonal Procrustes problem 923

[5] Green and Goers, A problem with congruence, presented at the Annual Meeting of the Psychometric Society, Monterey, California, 1979.

[6] M. Heath, A. Laub, C. Paige and B. Ward, Computing the singular value decomposition of a product of two matrices, SIAM J. Sci. Star. Cornput. 7 (1986)1147-1159.

[7] F.T. Luk, Orthogonal rotation to a partially specified target, SIAM J. Sci. Star. Comput. 4 (1983) 223-228. [8] Mulaik, The Foundations of Factor Analysis (McGraw-Hill, New York, 1972)_ [9] H. Park, Efficient diagnnalization of oversized matrices on a distributed-memory multiprocessor, Ann. Oper. Res.

22 (1990) 253-269_ [10] S_ Sivanand and M. Kaveh, Direct design of spatio-temporal preprocessing for direction estimation of wideband

sources, presented in TENCON '89, Bombay, Nov. 1989. [11} J_M_F. Ten Berge and D.L. Knol, Orthogonal rotations to maximal agreement for two or more matrices of

different column orders, Psychornetrica 49 (1984) 49-55_ [12] J.M.F. Ten Berge and K_ Nevels, A general solution to Mosier's oblique Procrustes problem, Pst,chometrica 42

(1977) 593-600