8
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 14, m-91(1992) On Multiple Error Detection in Matrix Triangularizations Using Checksum Methods HAESUN PARK* Comprrrer Science Deparrment, Universiry of’ Minnesofrr, Minnetrpo1i.s. Minne.wt(l 55455 We introduce a unified checksum method for multiple transient error detection in three different matrix triangularizations: the LU decomposition, Gaussian elimination with pair-wise pivoting, and the QR decomposition. We first develop the theoretical back- ground for multiple error detection in matrix triangularizations, and summarize the results by proving that we can detect all the transient errors that occur in a maximum oft different columns by introducing t checksum vectors. A floating-point error analysis, to determine the effects of the rounding errors in using the checksum method for multiple error detection and correction, is also pre- sented. Q 1992 Academic PEZ.S, IN. 1. INTRODUCTION Algorithm-based fault tolerance [I, 6, 71 provides an efficient means to implement fault tolerant matrix compu- tations on systolic array architectures for signal and im- age processing applications. By modifying the algorithm so that it can process the encoded input data, a transient error can be detected and corrected using the checksum scheme through simple computations on the results. The consequence is a method that incurs very low overhead and uses simple arithmetic operations. For the analyses of hardware and computational overhead for the algo- rithm-based fault tolerance, see [6,7]. The idea of encod- ing the data is similar to that in the theory of error cor- recting codes [3] except that the encoding is done at the word level. The properties that are invariant under the computations are used to check the existence of the tran- sient error. Many checksum schemes have been devel- oped for achieving fault tolerant matrix operations such as matrix-matrix multiplication [ 1, 6, 71, LU decomposi- tion and matrix inversion [I, 6, 71, and matrix equation solvers [S]. Luk and Park 1101 extended and existing checksum approach to Gaussian elimination with pairwise pivoting [13] and the QR decomposition by Givens rotations and showed a way to compute the cor- rect decomposition from the erroneous one by factoriza- tion updating schemes [5]. The same authors also ana- * This work was supported in part by National Science Foundation Grant CCR-8813493. lyzed the effects of rounding errors [ 141 on the checksum scheme and established tolerances for single transient er- ror detection for matrix triangularizations 11I]. Various systolic arrays have been proposed for matrix triangular- izations [2, 4, 91 and the algorithm-based fault tolerance techniques we propose are applicable to all these archi- tectures. The study of fault models for a single transient error has been active. However, it is unrealistic to expect that only one transient error occurs during the entire process of computation. There is no doubt of the need for devel- oping checksum schemes to deal with multiple transient errors. In this paper, we extend Luk and Park’s unified checksum scheme for fault tolerant matrix triangulariza- tions [lo] (the LU decomposition, Gaussian elimination with pairwise pivoting, and the QR decomposition) for single transient error to a unified scheme for detecting multiple transient errors. We develop a linear algebraic approach for multiple error detection in matrix triangu- larizations using checksum schemes. The results are summarized and proved as a theorem showing that by introducing t checksum columns, we can detect all the transient errors that occur in a maximum of t different columns in the matrix triangularizations. We suggest a block checksum method as an alternative and discuss its advantages and limitations. A floating-point error analy- sis to determine the effects of the rounding errors in using checksum methods for multiple error detection is pre- sented. Finally, we introduce examples that show that even one transient error can make the corrected results by factorization updates useless due to rounding errors. 2. CHECKSUM SCHEME We will briefly review the checksum scheme of Luk and Park [lo] for detecting, locating, and correcting a single transient error in matrix triangularizations. Three triangular decompositions are represented by a unified decomposition, A = ZU, where U is an upper triangular matrix, and Z is a lower triangular matrix L for LU de- composition, a possibly full matrix X for Gaussian elimi- nation with pairwise pivoting, and an orthogonal matrix Q for the QR decomposition. To detect and locate a tran- 90 0743.7315192 $3.00 Copyright 0 1992 by Academic Press. Inc. All right\ of reproduction in any form reserved.

On multiple error detection in matrix triangularizations using checksum methods

Embed Size (px)

Citation preview

Page 1: On multiple error detection in matrix triangularizations using checksum methods

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 14, m-91(1992)

On Multiple Error Detection in Matrix Triangularizations Using Checksum Methods

HAESUN PARK*

Comprrrer Science Deparrment, Universiry of’ Minnesofrr, Minnetrpo1i.s. Minne.wt(l 55455

We introduce a unified checksum method for multiple transient error detection in three different matrix triangularizations: the LU decomposition, Gaussian elimination with pair-wise pivoting, and the QR decomposition. We first develop the theoretical back- ground for multiple error detection in matrix triangularizations, and summarize the results by proving that we can detect all the transient errors that occur in a maximum oft different columns by introducing t checksum vectors. A floating-point error analysis, to determine the effects of the rounding errors in using the checksum method for multiple error detection and correction, is also pre- sented. Q 1992 Academic PEZ.S, IN.

1. INTRODUCTION

Algorithm-based fault tolerance [I, 6, 71 provides an efficient means to implement fault tolerant matrix compu- tations on systolic array architectures for signal and im- age processing applications. By modifying the algorithm so that it can process the encoded input data, a transient error can be detected and corrected using the checksum scheme through simple computations on the results. The consequence is a method that incurs very low overhead and uses simple arithmetic operations. For the analyses of hardware and computational overhead for the algo- rithm-based fault tolerance, see [6,7]. The idea of encod- ing the data is similar to that in the theory of error cor- recting codes [3] except that the encoding is done at the word level. The properties that are invariant under the computations are used to check the existence of the tran- sient error. Many checksum schemes have been devel- oped for achieving fault tolerant matrix operations such as matrix-matrix multiplication [ 1, 6, 71, LU decomposi- tion and matrix inversion [I, 6, 71, and matrix equation solvers [S]. Luk and Park 1101 extended and existing checksum approach to Gaussian elimination with pairwise pivoting [13] and the QR decomposition by Givens rotations and showed a way to compute the cor- rect decomposition from the erroneous one by factoriza- tion updating schemes [5]. The same authors also ana-

* This work was supported in part by National Science Foundation

Grant CCR-8813493.

lyzed the effects of rounding errors [ 141 on the checksum scheme and established tolerances for single transient er- ror detection for matrix triangularizations 11 I]. Various systolic arrays have been proposed for matrix triangular- izations [2, 4, 91 and the algorithm-based fault tolerance techniques we propose are applicable to all these archi- tectures.

The study of fault models for a single transient error has been active. However, it is unrealistic to expect that only one transient error occurs during the entire process of computation. There is no doubt of the need for devel- oping checksum schemes to deal with multiple transient errors. In this paper, we extend Luk and Park’s unified checksum scheme for fault tolerant matrix triangulariza- tions [lo] (the LU decomposition, Gaussian elimination with pairwise pivoting, and the QR decomposition) for single transient error to a unified scheme for detecting multiple transient errors. We develop a linear algebraic approach for multiple error detection in matrix triangu- larizations using checksum schemes. The results are summarized and proved as a theorem showing that by introducing t checksum columns, we can detect all the transient errors that occur in a maximum of t different columns in the matrix triangularizations. We suggest a block checksum method as an alternative and discuss its advantages and limitations. A floating-point error analy- sis to determine the effects of the rounding errors in using checksum methods for multiple error detection is pre- sented. Finally, we introduce examples that show that even one transient error can make the corrected results by factorization updates useless due to rounding errors.

2. CHECKSUM SCHEME

We will briefly review the checksum scheme of Luk and Park [lo] for detecting, locating, and correcting a single transient error in matrix triangularizations. Three triangular decompositions are represented by a unified decomposition, A = ZU, where U is an upper triangular matrix, and Z is a lower triangular matrix L for LU de- composition, a possibly full matrix X for Gaussian elimi- nation with pairwise pivoting, and an orthogonal matrix Q for the QR decomposition. To detect and locate a tran-

90 0743.7315192 $3.00 Copyright 0 1992 by Academic Press. Inc. All right\ of reproduction in any form reserved.

Page 2: On multiple error detection in matrix triangularizations using checksum methods

MULTIPLE ERROR DETECTION IN TRIANGULARIZATIONS 91

sient error, the IZ x (n + 2) checksum matrix, A,,. = (A 1 AfAg), is used, where f = (1, 1, . . . . l)T and g = (g, , gz, . ..) g,)r, with g; > 0 and gi # gj for i # j. The ZU decom- position of the matrix A,,, yields A, = Z( U 1 c v) = ZU, , where c = Uf and v = Ug. In the presence of a single transient error in the jth column, regardless of the time when the transient error occurs, the initial matrix can be represented as an erroneous one by the error model A = A - qe: for some vector q. Assuming that the error does not occur in checksum vectors, the ZU decomposition of the erroneous checksum matrix x,,, yields A,,, = (A 1 Af -- Ag)=Z(U(FF)=zZT;i,, wherec= nf+Z-‘qandfi= i7g + gjZ-'q.

We assume that there is a mechanism to check each elementary transformation for its validity of operational properties; e.g., the multiplier in each elementary trans- formation of Gaussian elimination with pairwise pivoting has its absolute value not exceeding one, and Z is orthog- onal in the case of the QR decomposition. Computing the checksum difference vector r = C - rrf(=Z-l q), the presence of a transient error is detected and from the second checksum difference vector s = v - Vg (=gjZ-‘q) and the relation

the column index j is determined. Denoting the jth columns of A and i-f by q.j andEj, respectively, and since -- A = Z( U + pe,r), where p = Z-‘a,- - i7.,i, a factorization update scheme [S] is used to get the correct decomposi- tion from the erroneous one.

3. ERROR PROPAGATION PATTERN

Even when there occurs at most one transient error in the entire computation procedure, many elements in the resulting matrices Z or r/,,. could be erroneous, since the transient error will affect the ensuing computation. In the ZU decomposition, the operations performed are combi- nations of the following elementary row operations: (1) multiply a row by a scalar, (2) interchange two rows, and (3) add two rows.

We assume that the resulting zeros in the lower trian- gular part of the matrix U are automatically set to zeros in the process of computation. Thus, the erroneous re- sulting matrix r/ is always upper triangular but the zeros in the lower triangular part might be of dubious validity. Even if an error occurs in the matrix Z, i.e., while com- puting a multiplier or a rotation parameter, it can be con- sidered as an error due to an erroneous element in the matrix being triangularized.

We discuss the patterns of error propagation in the ZU decomposition, which are illustrated using a 5 X 5 ma- trix. The original transient error is represented as E, the

. uuuuuluu OuuEuluu

(a) OOueuluu @I OOOeuluu OOOeelee,

r uuuuuluu

0uuuuluu

00uuuluu

OEeeelee OOOeelee

FIG. 1. (a) Error in U and (b) error in L.

propagated error as e, the final element of the upper trian- gular matrix as u, and the rest are represented as X. First, we consider the LU decomposition. The transient error stays where it first occurs in the matrix for the rest of the computation and it is propagated down or to the right until it hits the diagonal and then to all subsequent ele- ments down and to the right in a,,.. For example, if the transient error occurs in the process of computing an element that would belong to U, then the error propa- gates as in Fig. la. If the transient error occurs at an element that would belong to L, then the error propagates as in Fig. lb. We can point out the position (i, j) where the transient error E occurred by finding the first incon- sistent row (say i) and the ratio of checksum entries.

Unlike in the LU decomposition, we cannot pin-point the initial location of the transient error in Gaussian elim- ination with pairwise pivoting or in the QR decomposi- tion. This is because an error in a row of larger index may affect a row of smaller index when two rows are coupled in Gaussian elimination with pairwise pivoting and in the QR decomposition. We show how an error propagates in the QR decomposition in the following example. The er- ror propagation pattern of Gaussian elimination with pairwise pivoting is similar.

(1) The transient error E occurs at the (4,2) element after the (4,l) element is annihilated.

(2) When E is annihilated, the rotation parameter is computed based on E, thus the error propagates on rows 2 and 4.

(3) The rest of the computation results in propagated errors.

As we have seen in the above example, the transient error E moves only along one column and this column is

Page 3: On multiple error detection in matrix triangularizations using checksum methods

92 HAESUN PARK

the sole cause for the inconsistent checksum, since only row operations are performed. Note that the checksum vectors are also affected by the transient error. When there are multiple transient errors, they follow the above patterns independent of each other.

4. MULTIPLE ERROR DETECTION IN

MATRIX TRIANGULARIZATIONS

The checksum methods have been well developed for fault tolerant matrix triangularizations in the presence of a single transient error on systolic arrays [lo, 111. How- ever, it is unrealistic to expect only a single transient error in the entire process of computation. For systolic array computations, periodic checking is not easily ac- complished. Moreover, the checking frequency would have to be at least as high as the rate of error occurrence. We will briefly review the checksum scheme that Jou and Abraham [7] proposed for multiple error detection in a data matrix, where the idea is based on the theory for error correcting codes [3]. Then we will extend the idea for multiple error detection in matrix triangularizations and show the relations between the number of transient errors and the number of necessary checksum columns for guaranteed multiple error detection. We will call a matrix on which no operation is performed a datu matrix or initial matrix.

For a given n x n matrix A, define an n x (n + t) generator matrix G as

G = (I,, WT), (2)

and a t x (n + t) checksum-check matrix H as

H = ( W -I,), (3)

where I, E R’“’ is an identity matrix and W E RI”“. Define an n x (n + t) coded matrix At,,c,, = AG, and define the code space of H as code(H) = {V 1 Hv = O}. Then each row of the coded matrix A,,.(,, is in code(H). If we choose the matrix Win a way so that every set of d (St) columns from the matrix H is linearly independent, i.e., H is of distance d + 1, then H can be used to detect a maximum of d errors in a data matrix. Specifically, for the existence of up to d errors, a row of A,,.(,, is in code(H) if and only if it is error free [3, 71. When there are errors in the matrix A and it becomes an erroneous matrix, which we denote by A, we define the erroneous coded matrix of A and x as A,,.Cl, = (A 1 AWT). We can actually find a t x n matrix W so that every set of t columns from the matrix H is lin- early independent, thus we can detect t errors. In fact, for an n X n data matrix A, up to n errors can be detected using one checksum column if none of the errors occur on the same row. This is clear since for each row, we

have one checksum element that can detect an error on the corresponding row (there is no error propagation since no computation is involved). We now develop lin- ear algebraic models for multiple error detection in ma- trix triangularizations.

LEMMA 1. For an n x n datu matrix A, a maximum of n x t transient errors can be detected using a generator matrix G = (Z,, W) with W E R’“” if not more than t errors occur in any one row.

Proof. We can choose a t x n matrix W so that every t columns from the checksum-check matrix H = ( W --I,) are linearly independent. Then each row of the coded matrix A,,ct, is in code(H) if and only if it is error free, for the existence of a maximum oft errors. Since there are n such rows, a maximum of n x t errors can be detected using t checksum columns unless more than t errors oc- cur on any one row. n

From Lemma 1, we can state that for an n x n data matrix A, all the errors on t erroneous columns can be detected using t checksum columns.

LEMMA 2. Zf two vectors u und w are in code(H), then CYV + @v is also in code(H) for any scalars (Y and p.

Proof. This is trivial since H((Yv + pw) = 0. n

LEMMA 3. Suppose un n x n matrix A has its triungu- lurization A = ZU. Then the ZU decomposition of the coded matrix A,,c,, = (A 1 AWT) yields Z(U 1 iJWT) = ZU”,(,, .

Proof. A,,.([, = (A 1 AWr) = (ZU 1 A WT) = Z( U / Z-‘AWT) = Z(U 1 UWT) = ZU ,I>,,,. n

Detecting errors in matrix triangularizations is not as simple as detecting errors in the data matrix because of the computations involved. As we have seen in error propagation patterns in Section 3, even one transient er- ror in the matrix A may contaminate a large part of the resulting factor U (consider the case when the error oc- curs in the (2,1) element of A). An important observation is that the checksum vectors are also affected by the transient errors and become propagated errors. Accord- ingly, the inconsistent checksum vectors are exclusively accounted for by the transient errors and the propagated errors in the same columns as the transient errors. We assume that A@’ = A and each step of computation is explained with the equation A(‘+‘) = Z(nA(fi, where Z(O de- notes the appropriate transformation at the ith time step.

LEMMA 4. Suppose that two transient errors occur in the (i,, j,) positions of the matrix during triungulurizu- tion, I I 1 2 2. Then the erroneous result is the same us the result of the correct triangulurizution of the errone- ous initial data mutrix A(O) with erroneous columns j,‘s,

Page 4: On multiple error detection in matrix triangularizations using checksum methods

MULTIPLE ERROR DETECTION IN TRIANGULARIZATIONS 93

1 i 1 I 2, regardless of the time steps of the error occur- detected using one checksum column regardless of the rences. number of errors.

Proof. Assume that two transient errors of size y/ oc- cur at the k&h step of the computation, in the (i,, j,) position, 1 5 1 I 2. Then we can represent the erroneous matrices of time steps k, and k2 as

x(h) zx A’h’ - Y,e;,e; and x(k?) = z(k?) - y2ei,f,:,

respectively, assuming that k, 9 k2 without loss of gener- ality. Since

x’h’ r T,A’O’ - y,e. eT 11 JI ’

where T, = Z’“,r”Z’k2’ . . . Z’O’

Proof. Assume that k errors occur in columnj during triangularization. From Lemma 4, the erroneous initial matrix x(O) can be represented as x(O) = A(O) - cf= ,qief = A(O) - qef. Since one column in the initial data matrix is erroneous regardless of the number of transient errors, it can be detected using one checksum column by Lemmas 1, 2, and 5. If the errors on the same column cancel each other out, we cannot detect them since the vector q would be zero. However, this is no problem since we have A(O) = A(“‘, thus errors correct themselves in this case. n

and Finally, we summarize the above results in the following theorem.

jp’ = T2~,k,’ - y2eizeI, where T2 = z’h-1’ . . . ,=‘k,) 3

the initial erroneous matrix K(O) can be represented as x(o) = A(o) - ql,$ - qzei, where q1 = YIT;‘ei, and q2 = y2(T2TT1ei,. .

We can easily generalize Lemma 4 and conclude that if t transient errors occur in the (it ,j,), 1 5 15 t, positions of the matrix during triangularization, then the erroneous results can be considered as a consequence of the errone- ous columns j,, 1 % 1 5 t, of the initial matrix A(O).

LEMMA 5. Assume that the initial matrix A’O’ has a maximum oft erroneous columns, and that there occurs no transient error during triangularization of the errone- ous initial matrix A(O). Then the errors in the resulting matrix 0 can be detected using t checksum columns.

THEOREM 7. For matrix triangularizations, using t checksum uectors, we can detect all the transient errors occurring in a maximum oft different columns.

Proof. From Lemmas 4 and 6, the erroneous result with the errors occurring in t different columns during triangularization can be considered as a result of the cor- rect triangularization of the erroneous initial matrix with t erroneous columns. The errors in the result can be de- tected using t checksum columns, from Lemma 5. n

As we have shown in Lemma 5, as a result of a triangular- ization on an erroneous coded matrix A$,, = (?i(O) ( A(O) WT), we get factors

-- - -- Z( t!J 1 Z- ‘A’O’ WT) = Z U,,.(,, .

Proof. We can find a t x n matrix W so that every set oft columns of the checksum-check matrix H = ( W -I,) is linearly independent. Then, each row of A!!:,, = (A(O) 1 ACo’WT) is in code(H) if and only if it is error free, for the existence of a maximum of t errors. Suppose there is at -- least one error in the initial matrix, and we get Z( U 1 Z-‘A(O) WT) = zu,,.C,, after the triangularization of the er- roneous coded matrix x$),, = (x(O) 1 A(O’Wr). If all the rows of I!?,,,,,) belong to code(H), then from Lemma 2, all -- the rows of ZU,(,, also belong to code(H) since Z is a product of elementary row operations. Then this implies that

To detect the existence of transient errors, we check whether each row of tfi,.(,, is in code(H). As a way to perform this, we define the checksum difference matrix DW as

Since

D wct) = .?-lA(O’W - VW. (4)

= W( UT - A’O”Z-T) = 0

(W -Z,)(.m,,,,,,)T = (W --I!) ($g!) = 0. if and only if ( u - Z-‘A(O)) WT = - D,,,Cf, = 0, every row of u,,,(,) is in code(H) if and only if the checksum difference matrix DnjCt’ is zero.

Thus, we can conclude that each row of the matrix K!!/,, is either error free or has more than t errors from Lem- mas 2 and 3, a contradiction. n

LEMMA 6. If the transient errors occur in one column of the matrix during triangularization, then they can be

It is difficult to apply the checksum methods for loca- tion of multiple transient errors in the matrix triangular- izations, let alone for correction of those errors. In the case of one transient error, it is possible to locate the error from the relation (1) between the checksum differ- ence vectors. In the case of multiple transient errors,

Page 5: On multiple error detection in matrix triangularizations using checksum methods

94 HAESUN PARK

there is no simple relation like (1) among the checksum difference vectors that allows the location of the transient errors, as we see in the following example.

Consider the encoded matrix A,,,(,,

If a transient error makes the multiplier 1(2,1) become 3 and another transient error causes the element a(2,2) to be updated as - 1, then we have

and the checksum check matrix

Even when two transient errors occur in the same posi- tions as above, the columns in the matrix D,,,c2j may have completely different relations from the above: consider the case when the first transient error makes the multi- plier 1(2,1) become 2 and the other transient error causes the element u(2,2) to be updated as -2; then the check- sum difference matrix

00 D w(2) =

iI 22 . 5.5

The situation would be the same even if we introduce more checksum vectors. Our first example also illustrates that as a result of transient errors that cancel each other out occurring on the same row, the first checksum differ- ence vector loses the ability to detect the errors.

As an alternative to the method discussed so far for multiple error detection, we can use the block checksum method [6, 121. In the block checksum method, the ma- trix is divided into submatrices of column blocks and by introducing two checksum vectors for each block we can detect, locate, and correct a single error in each block of the matrix. But, it has the limitation that only one error is allowed in each block, thus the transient errors are al- lowed only in a restricted pattern. Also correcting by factorization updates has the following drawbacks: it is difficult to implement in a parallel environment [5] and it could be numerically unreliable as we will show in the next section.

5. THE EFFECTS OF ROUNDING ERRORS

in the previous section, the checksum method is theo- retically proved to guarantee multiple error detection with careful choice of checksum vectors in triangulariza- tions. But in actual computation, because of the inevita- ble existence of rounding errors, detecting transient er- rors becomes a nontrivial matter: we cannot expect that the checksum difference matrix D,,,(,) is zero even when there is no transient error. Thus we need to estab- lish a tolerance r so that the result is accepted to be transient error free when the checksum difference matrix satisfies the relation 11 D,,,(,)jI i T. Small transient errors are difficult to distinguish from rounding errors especially in the LU decomposition and Gaussian elimination with pairwise pivoting [I 11. Using a large tolerance in error detection means that large transient errors can be by- passed and regarded simply as rounding errors. We first review the analyses in [ 1 I] and suggest tolerances for multiple transient error detection. Then we show that though large transient errors are detectable, correcting them by factorization updates could be rendered useless in the LU decomposition because of rounding errors.

The machine precision is denoted as F, which is the smallest positive number, so thatj( 1 + E) # 1, where j denotes the floating-point operation. We use the hat sym- bol ““’ to represent the floating-point results. Assuming j(A) = A, consider the floating-point coded matrix a w(tj = (A 1 jl(AwT)). After triangularization of &,, , we get

a,,.,,, -=+ ( 0 I C)

for some matrix c E R”“‘. Let w and C: be thejth columns of WT and C, 1 5 j I t, respectively. Then from the results in [1 I], the jth column d of the floating-point checksum difference matrix &,) is given by

a = fl(ow - e) = m?(FW - .h - .h) + fd, (6)

whereI, = diug(1 + pl, . . . . 1 + F,), with Ipi] 9 E, 1 5 i 5 n, the vectorfA satisfies

%(Aw) = Aw + .h, llf~ll 5 1.06dA11 IbdI, (7)

the vector f, satisfies

jl(riw) = riw + fo, lIftI 5 1.06.~~~~ 011 IIwII, (8)

the vectorfz satisfies the relation k’(./l(Aw) + fi) = F, and F is the matrix such that the exact triangularization of the matrix A + F gives factors 2 and 0. Then from (6), we get the inequality

11~11 5 Il~,ll(Il~-‘lI(IlF;wll + bi 11 + llhll) + Ilfd). (9)

Page 6: On multiple error detection in matrix triangularizations using checksum methods

MULTIPLE ERROR DETECTION IN TRIANGULARIZATIONS 95

It was shown in [II] that for single error detection, the tolerance r is necessarily large in the case of Gaussian elimination (with and without pivoting) since )I.?-‘)) can be large. The same results hold for multiple error detec- tion. In the case of the QR decomposition, we have

and

llfill~ 5 ~(1 + ~)S-‘IIAIIFI~W~~~(’ + 1.06.9n), (‘1)

where we may take q = 7.5~ and s = 3n - 5. Thus from (7)-U ‘1,

lIdI 5 (1 + ~]/A]]&v]]z(~s(~ + 71)‘-l(2 + 1.06&n) + 2( 1.06.~))

(12)

and we may choose

7 = ~On~llAII~b”ll2 (13)

as a tolerance for the jth column. For multiple transient error detection, we can use t tolerances rj, 1 5 j 5 t with t checksum columns. Thus the detection procedure for t checksum columns and WT = [w,, . . . , w,] becomes: if ldijl- jf 11 < 7 or a i and j where Tj is the T in (13) computed with w = Wj, then accept the results as transient error free.

Finally, we show that though large transient errors are detectable, correcting them by factorization updates could be rendered useless in the LU decomposition be- cause of rounding errors. We assume floating-point arith- metic on a decimal computer with a word length of five.

EXAMPLE 1. Consider the system

Ax = (-‘;-4 ;)(;;j = (;)

whose exact solution is xl = -0.99990... and x2 = 0.99990.... The coded matrix

A ( . - ;.; x 10-J 1.0 I 9.9990 X 10-I 1.9999

w(2) = 1.0 I 2.0 3.0

has the LU decomposition

( -1.0 0.0 X 10-d 1.0 1.0001 X 104 I ]

9.9990 x 10-t 1.9999 1.0001 x 104 2.0002 x 104 ) .

Suppose that the (2,l) element of A is changed to 1.0 X lo*. Then we have

i -1.0 0.0 x 10-4 1.0 1.0 x 106 I I

9.9990 x 10-l 1.9999 9.9990 x 105 1.9999 x 106 1 .

From

0.0 and S=E4- ug= -1ox ( .

we conclude that the (2,l) element is erroneous. For the error correction, we compute the vector

p = E-la., - E,,

1.0 0.0 -1.0 x 10-4 zz ( 1.0 x 106 1.0 !( 1.0 i

-1.0 x 10-4 - i

0.0 0.0 1 i = -9.9 x 10’ ’ i

and we get the corrected decomposition by the factoriza- tion update:

-- A = L(U+pe:)

- E i;:1: x 105 00 1:o I(

-1 0 x 10-4 1.0 0:o 1.0 x 104 !

= i

1.0 0.0 -1.0 x 10-4 1.0 -1.0 x 104 1.0 )i 1.0 1.0 x 104 1 .

Solving the system using the updated factors, we get xl = 0.0 and x2 = 1.0. The value of xl has no digits of accu- racy. The idea of using the factorization updating schemes for the error correction [ 101 was proposed as an improvement over computation rollback [l, 6, 71 since computation rollback is expensive to perform on systolic arrays. But, as we have seen, the factorization update schemes may produce useless answers in the case of the LU decomposition.

EXAMPLE 2. The QR decomposition and Gaussian elimination with pairwise pivoting are reliable algorithms in practice. However, the following example shows that the error correction scheme for Gaussian elimination with pairwise pivoting may also give an answer having no digits of accuracy. We can easily come up with an analo- gous example for the QR decomposition. Consider the system

Page 7: On multiple error detection in matrix triangularizations using checksum methods

96 HAESUN PARK

whose exact solution is xl = 1 and x2 = -1. Gaussian elimination with pairwise pivoting on

A 1.0 X 10’ 1.0 1 1.1 X 10’ 1.2 X 10’ w(2) = 1.0 1.0 ( 2.0 3.0 !

gives

1.0 X 10’ 1.0 I 0.0 9.0 x 10-I 1

1.1 x 10’ 1.2 X 10’ 9.0 x 10-l 1.8 .

Suppose that the (2,l) element of A is changed to 1 .O x 105. Then we have

1.0 x 105 1.0 1 2.0 3.0 0.0 9.999 x 10-l I 1.1 x 10’ 1.2 X 10’

From

T= & - gf = -1.0 x 10’ 1.0 x IO’

and J = ii,4 - )

we have the relation 5 = 1 x r and conclude that the first column is erroneous. For the error correction, we com- pute the vector

and get the correct decomposition by the factorization update:

-- A =X(U+per)

0 0 1 x (1:o

0 = 0:o K

1 0 x 10’ 0:o

9.999 x 10-1 1.0 1

= ( 1.0 1.0 x 10-4 1.0 x 10’ 9.999 x 10-l 0.0 1.0 I( 0.0 1.0 1.

Solving the system using the updated factors, we get xl = 0.9 and x2 = 0.0, but x2 has no digits of accuracy.

For Gaussian elimination with pairwise pivoting and

the QR decomposition, the situation can be greatly im- proved: replace the erroneous column, say column j, by Z-‘a.j, instead of computing the vector p and u + pef before the factorization update. Error correction with factorization updating has another barrier in systolic ar- ray environment in that it is difficult to implement since there is an intrinsically sequential nature to the updating algorithm [5].

6. REMARKS

We presented a unified checksum scheme to detect multiple transient errors in three different matrix triangu- larizations: the LU decomposition, Gaussian elimination with pairwise pivoting, and the QR decomposition. We established a linear algebraic method for the multiple er- ror detection in matrix triangularizations and gave rigor- ous proofs. Unfortunately, the problem of locating and correcting the errors seems to be intractable with this scheme. However, in many cases, error detection itself is extremely important. The block checksum scheme al- lows us to locate and correct multiple errors but it has the limitation that it reduces the portion of the matrix in which multiple errors are allowed. We established toler- ances for multiple error detection under the presence of rounding errors. It is hoped that future research is con- ducted to solve the problem of multiple error location and correction for matrix triangularizations.

I.

2.

3.

4.

5.

6.

7.

8.

9.

IO.

REFERENCES

Abraham, J. A. Fault tolerance techniques for highly parallel signal processing architectures. In Bromley, K. (Ed.). Highly Parallel Signal Processing Architectures. Proc. SPIE, Vol. 614 pp. 49-65, 1986. Ahmed, H. M., Delosme, J.-M.. and Mot-f. M. Highly concurrent computing structures for matrix arithmetic and signal processing. Computer 15, I (Jan. 1982), 65-82. Blahut, R. E. Theory und Practice of Error Control Codes. Addi- son-Wesley, New York, 1984. Gentleman, W. M., and Kung, H. T. Matrix triangularization by systolic arrays. In Tao. T. F. (Ed.). Real Time Signed Processing IV. Proc. SPIE, Vol. 298, pp. 19-26, 1981. Gill, P. E., Comb, Cl. H., Murray, W., and Saunders, M. A. Meth- ods for modifying matrix factorizations. Math. Compur. 28 (1974). 505-535.

Huang, K.-H., and Abraham, J. A. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. C 33 (1984). 518-528. Jou, J.-Y., and Abraham, J. A. Fault-tolerant matrix operations on multiple processor systems using weighted checksums. In Brom- ley, K. (Ed.). Real Time Signul Processing VII, Proc. SPIE, Vol. 495 pp. 94-101, 1984. Luk, F. T. Algorithm-based fault tolerance for parallel matrix equa- tion solvers. In Miceli, W. J., and Bromley, K. (Eds.). Real Time Signal Processing VIII. Proc. SPIE, Vol. 564, pp. 49-53, 1985. Luk, F. T. A rotation method for computing the QR-decomposi- tion. SIAM J. Sci. Sfutist. Comput. 7, (1986), 452-459. Luk, F. T.. and Park, H. Fault-tolerant matrix triangularization on systolic arrays. IEEE Trans. Comput. (1989), 1434-1438.

Page 8: On multiple error detection in matrix triangularizations using checksum methods

II.

12.

13.

14.

MULTIPLE ERROR DETECTION IN TRIANGULARIZATIONS 97

Luk, F. T., and Park, H. An analysis of algorithm-based fault- HAESUN PARK was born in Seoul, Korea. She received the B.S. tolerance techniques. J. Purallel Distrih. Comput. 5 (19881, l72- degree in mathematics (summa cum laude, the top graduate in the Col- 184. lege of Natural Sciences) from Seoul National University, Seoul, Ko- Park, H. Multiple error algorithm-based fault tolerance for matrix rea, in 1981 and the M.S. and Ph.D. degrees in computer science from triangularizations. Proc. SPIE Chference on Advanced Algo- Cornell University, Ithaca, New York, in 1985 and 1987. respectively. rithms and Architectures fir Signed Processing. Vol. Ill. 1988. She has been an assistant professor in the Computer Science Depart-

Sorensen, D. C. Analysis of pairwise pivoting in Gaussian elimina- ment, University of Minnesota, Minneapolis, since 1987. Her current

tion. IEEE Truns. Comput. C 34 (19851, 274-278. research interests include numerical analysis and parallel processing for

Wilkinson, J. H. The Algebraic Eigmualuc~ Problem. Oxford Univ. scientific computing. She is a member of IEEE, IEEE Computer Soci-

Press, London, 1965. ety, and SIAM.

Received June 28, 1989; accepted March 4, 1991