Optimal systolic array algorithms for tensor product

Applied Mathematics and Computation 168 (2005) 496–518

www.elsevier.com/locate/amc

Optimal systolic array algorithmsfor tensor product

P.K. Mishra

Department of Applied Mathematics, Birla Institute of Technology, Mesra, Ranchi 835 215, India

Abstract

In this paper we examine the computational complexity of optimal systolic array

algorithms for tensor product. We provide a time minimal schedule that meets the com-

puted processor lower and upper bounds including one for tensor product. Our princi-

ple contribution is to find lower and upper bounds in the algorithm and its

implementation for generating function and find processor-time-minimal schedule.

� 2004 Elsevier Inc. All rights reserved.

Keywords: Computational complexity; Systolic algorithms; Tensor product; Lower and upper

bounds

1. Introduction

Systolic arrays are important for application-specific systems, they perform

well because they are highly concurrent [54,60]. Their design cost is low be-

cause, they comprise a large number of processors, they are of only one type

(or a small fixed number of types). Inter processor communication in between

nearest neighbors, hence has short latency. Nearest neighbor connections also

0096-3003/$ - see front matter � 2004 Elsevier Inc. All rights reserved.

doi:10.1016/j.amc.2004.09.045

E-mail addresses: [email protected], [email protected]

mailto:[email protected]

P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518 497

imply that the VLSI circuit is devoted primarily to the processors—very little to

their interconnection. As a consequence, they are ideally suited to VLSI tech-

nology [22]. Minimizing the amount of time and number of processors needed

to perform an application reduces the application�s fabrication cost and oper-

ation costs. This can be important in applications where minimizing the size

and energy of the hardware is critical, such as space applications and consumerelectronics [16].

We consider computations suitable for systolic arrays, often called regular

array computations or systems of uniform recurrence equations [27,31,33]. Paral-

lel execution of uniform recurrence equations has been studied extensively,

from at least as far back as 1966 (e.g., [26,32,30,8,38,10,11,39,40,14]). In such

computations, the tasks to be computed are viewed as the nodes of a directed

acyclic graph (dag), where the data dependencies are represented as arcs. Given

a dag G = (N,A), a multiprocessor schedule assigns node t for processing dur-ing step s(t) on processor p(t). A valid multiprocessor schedule is subject to

two constraints: we consider computations suitable for systolic arrays, often

called regular array computations or systems of uniform recurrence relations.

In such computations, the tasks to be computed are viewed as the nodes of a

directed acyclic graph (dag), where the data dependencies are represented as

arcs. A processor-time-minimal schedule measures the minimum number of

processors needed to extract the maximum parallelism from the dag. We pres-

ent a technique for finding a lower and upper bound on the number of proces-sors needed to achieve a given schedule of an algorithm represented as a dag.

The application of this technique is illustrated with a tensor product computa-

tion. We then consider the free schedule of algorithms for matrix product. For

this problem, we provide a time-minimal processor schedule that meets the

computed processor lower bounds, including the one for tensor product.

Casuality: A node can be computed only when its children have been com-

puted at previous steps

ðu; tÞ 2 A) sðuÞ < sðtÞ: ð1ÞNon-conflict: A processor cannot compute 2 different nodes during the same

time step

sðtÞ ¼ sðuÞ ) pðtÞ 6¼ pðuÞ: ð2ÞIn what follows, we refer to a valid schedule simply as a schedule. A time-

minimal schedule for an algorithm completes in S steps, where S is the number

of nodes on a longest path in the dag. A time-minimal schedule exposes an

algorithm�s maximum parallelism. That is, it bounds the number of time steps

the algorithm needs, when infinitely many processors may be used. A proces-

sor-time-minimal schedule is a time-minimal schedule that uses as few proces-

sors as any time-minimal schedule for the algorithm. Although only one of

many performance measures. Processor-time-minimality is useful because it

498 P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518

measures the minimum number of processors needed to extract the maximum

parallelism from a dag. Being machine-independent, this measure is a more

fundamental measure than those that depend on a particular machine or archi-

tecture. This view prompted several researchers to investigate processor-time-

minimal schedules for families of dags. Processors-time-minimal systolic arrays

are easy to devise, in an ad hoc manner for 2D systolic algorithms. This appar-ently is not the case for 3D systolic algorithms. There have been several pub-

lications regarding processor-time-minimal systolic arrays for fundamentals

3D algorithms. Processor-time-minimal schedules for various fundamental

problems have been proposed in the literature: Scheiman and Cappello

[9,5,48,45] examine the dag family for matrix product; Louka and Tchuente

[34] examine the dag family for Gauss–Jordan elimination; Scheiman and

Cappello [46,47] examine the dag family for transitive closure; Benaini and

Robert and Trystram, and Rote [2–4,41,42] examine the dag families for thealgebraic path problem and Gaussian elimination. Each of the algorithms

listed in Table 1 has the property that, in its dag representation, every node

is on a longest path. Therefore, for each algorithm listed. Its free schedule is

its only time-minimal schedule. A processor lower bound for achieving time-

minimal schedule is thus a processor lower bound for achieving maximum par-

allelism with the algorithm [6,7].

Clauss et al. [12] developed a set of mathematical tools to help find a proces-

sor-time-minimal multiprocessor array for a given dag. Another approach to ageneral solution has been reported by Wong and Delosme [58,59], and Shang

and Fortes [49]. They present methods for obtaining optimal linear schedules.

That is, their processor arrays may be sub-optimal, but they get the best linear

schedule possible. Darte et al. [14] show that such schedules are close to opti-

mal, even when the constraint of linearity is relaxed. Geometric/Combinatorial

formulations of a dag�s task domain have been used in various contexts in par-

allel algorithm design as well (e.g., [26,27,32,38,39,19,18,40,12,49,55,59]; see

Fortes et al. [17] for a survey of systolic/array algorithm formulations).We present an algorithm for finding a lower bound on the number of pro-

cessors needed to achieve a given schedule of an algorithm. The application of

this technique is illustrated with a tensor product computation. Then, we apply

Table 1

Some 3D algorithms for which processor-time-minimal systolic arrays are known

Algorithm Citation Time Number of processors

Algebraic path problem [3] 5n�2 n2/3 + O(n)

Gauss–Jordan elimination [34] 4n 5n2/18 + O(n)

Gaussian elimination [3] 3n � 1 n2/4 + O(n)

Matrix product [9,5] 3n � 2 [3n2/4]

Transitive closure [46] 5n � 4 [n2/3]

Tensor product this article 4n � 3 (2n2 + n)/3


the technique to the free schedule of algorithms for matrix product. For this,

we provide a compatible processor schedule that meets the these processor

lower bounds (i.e., a processor-time-minimal schedule) including the one for

tensor product. We finish with some general conclusion and mention some

open problems.

One strength of our approach centers around centers around the world algo-

rithm: we are finding processor lower bounds not just for a particular dag, but

for a linearly parameterized family of dags, representing the infinitely many

problem sizes for which the algorithm works. Thus, the processor lower bound

is not number but a piecewise polynomial function [1,61,62] in the parameter

used to express different problem sizes. The formula then can be used to opti-

mize the implementation of the algorithm, not just a particular execution of it.

The processor lower bounds are produced by:

• Formulating the problem as finding, for a particular time step, the number

of processors needed by that time step as a formula for the number of inte-

ger points in a convex polyhedron.

• Representing this set, of points as the set of solutions to a linearly parame-

terized set of linear Diophantine equations.

• Computing a generating function for the number of such solutions.

• Deriving a formula from the generating function.

The ability to compute formulae for the number of solutions to a linearly

parameterized set of linear Diophantine equations has other applications for

nested loops [13], such as finding the number of instructions executed, the num-

ber of memory locations touched, and the number of I/O operations.

The strength of our algorithm—its ability to produce formulae—comes, of

course, with a price: the computational complexity of the algorithm is exponen-

tial in the size of the input (the number of bits in the coefficients of the system

of Diophantine equations). The algorithm�s complexity however is quite rea-sonable given the complexity of the problem: Determining if there are any inte-

ger solutions to the system of Diophantine equations are NP-complete [20]. We

produce a formula for the number of such solutions.

Clauss and Loechner [13] independently developed an algorithm for the

problem based on Ehrhart polynomials. In [13], they sketch an algorithm for

the problem, using the ‘‘polyhedral library’’ of Wilde [57].

2. Processor lower bounds

We present a general and uniform technique for deriving lower bounds:

given a parameterized dag family and a correspondingly parameterized lin-

ear schedule. We compute a formula for a lower bound on the number of


processors required by the schedule. This is much more general than the anal-

ysis of an optimal schedule for a given specific dag. The lower bounds obtained

are good; we know of no dag treatable by this method for which the lower

bounds are not also upper bounds. We believe this to be the first reported algo-

rithm and its implementation for automatically generating such formulae.

The nodes of the dag typically can be viewed as lattice points in a convexpolyhedron. Adding to these constraints the linear constraint imposed by the

schedule itself results in a linear Diophantine system of the form

Az ¼ nbþ c; ð3Þwhere the matrix A and the vectors b and c are integral, but not necessarily non-

negative. The number dn of solutions in non-negative integers z = [z1,z2, . . . , zs]t

to this linear system is a lower bound for the number of processors requiredwhen the dag corresponds to parameter n. Our algorithm produces (symboli-

cally) the generating function for the sequence dn, and from the generating func-

tion, a formula for the numbers dn. We do not make use of any special

properties of the system that reflects the fact that it comes from a dag. Thus

in (1), A can be taken to be an arbitrary r · s integral matrix, and b and c arbi-

trary r-dimensional integral vectors. As such we actually solve a more general

combinatorial problem of constructing the generating functionP

nP0dntn, anda formula for dn given a matrixA and vectors b and c, for which the lower boundcomputation is a special case. There is a large body of literature concerning lat-

tice points in convex polytopes and numerous interesting results: see for exam-

ple Stanley [51] for Ehrhart polynomials (Clauss and Loechner [13] use these),

and Sturmfels [52,53] for vector partitions and other mathematical treatments.

Our results are based mainly on MacMohan [35,37], and Stanley [50].

3. Tensor product

We examine the dag family for the 4D mesh: the n · n · n · n directed mesh.

This family is fundamental, representing a communication-localized version of

the standard algorithm for tensor product (also known as Kronecker product).

The tensor product is used in many mathematical computations, including

multivariable spline blending and image processing [21], multivariable approxi-

mation algorithms (used in graphics, optics, and topography) [29], as well as

many recursive algorithms [28].The tensor product of an n · m matrix A and a matrix B is

A� B ¼a11B a12B . . . a1mB

. . .

a B a B . . . a B

264

375: ð4Þ

n1 n2 nm


The 4D mesh also represents other algorithms, such as the least common sub-

sequence problem [24,23] for 4 strings, and matrix comparison (an extension of

tuple comparison [25]).

3.1. The dag

The 4D mesh can be defined as follows. Gn = (Nn,An), where

� Nn ¼ ði; j; k; lÞj0 6 i; j; k; l 6 n� 1f g; ð5Þ

� An ¼ ½ði; j; k; lÞ; ði0; j0; k0; l0Þ�jði; j; k; lÞ 2 Nn; ði0; j0; k0; l0Þ 2 Nnf ð6Þand exactly 1 of following conditions holds

1: i0 ¼ iþ 1; j0 ¼ j; k0 ¼ k; l0 ¼ l; ð7Þ

2: j0 ¼ jþ 1; i0 ¼ i; k0 ¼ k; l0 ¼ l; ð8Þ

3: k0 ¼ k þ 1; i0 ¼ i; j0 ¼ j; l0 ¼ l; ð9Þ

4: l0 ¼ lþ 1; i0 ¼ i; j0 ¼ j; k0 ¼ k; ð10Þ

3.2. The parametric linear Diophantine system of equations

The computational nodes are defined by non-negative integral 4-tuples

(i, j,k, l) satisfying

i 6 n� 1; ð11Þ

j 6 n� 1; ð12Þ

k 6 n� 1; ð13Þ

l 6 n� 1: ð14Þ

Introducing non-negative integral slack variables s1, s2, s3, s4 P 0, we obtain

the equivalent linear Diophantine system describing the computational nodes

as

i þs1 ¼ n� 1;

j þs2 ¼ n� 1;

k þs3 ¼ n� 1;

l þs4 ¼ n� 1:

ð15Þ


A linear schedule for the corresponding dag is given by

sði; j; k; lÞ ¼ iþ jþ k þ lþ 1: ð16ÞFor this problem s ranges from 1 to 4n � 3. The computational nodes about

halfway through the completion of the schedule satisfy the additionalconstraint

iþ jþ k þ l ¼ 2n� 2: ð17ÞAdding this constraint we obtain the augmented Diophantine system

i þj þk þl ¼ 2n� 2;

i þs1 ¼ n� 1;

j þs2 ¼ n� 1;

k þs3 ¼ n� 1;

l þs4 ¼ n� 1:

ð18Þ

Therefore, a lower bound for the number of processors needed for the Tensor

Product problem is the number of solutions to (17). The corresponding Dio-

phantine system is

az ¼ nbþ c ð19Þwhere

a ¼

1 1 1 1 0 0 0 0

1 0 0 0 1 0 0 0

0 1 0 0 0 1 0 0

0 0 1 0 0 0 1 0

0 0 0 1 0 0 0 1

26666664

37777775; b ¼

2

1

1

1

1

26666664

37777775; c ¼

�2�1�1�1�1

26666664

37777775: ð20Þ

3.3. The Mathematica program input and output

The DiophantineGF run on this data gives

In[1]:=�DiophantineGF. m

Loaded.Type ‘‘help’’ for instructions, ‘‘example’’ for examples.

In½2� :¼ a ¼ff1; 1; 1; 1; 0; 0; 0; 0g;f1; 0; 0; 0; 1; 0; 0; 0g;

f0; 1; 0; 0; 0; 1; 0; 0g;f0; 0; 1; 0; 0; 0; 1; 0g;f0; 0; 0; 1; 0; 0; 0; 1gg;


In[3]:= b = {2, 1, 1, 1, 1}; c = {�2,�1,�1,�1,�1};In[4]:= DiophantineGF [a, b, c]

Out[4] = tð1þtÞ2ð�1þtÞ

In[5]:= formula

Binomial Formula: C[n, 3] + 2 C[1 + n, 3] + C[2 + n, 3]

Power Formulanð1þ2nÞ2ð3Þ .

Therefore,

2n3

3þ n3

is a processor lower bound for the Tensor Product problem.

4. General formulation

We now generalize this example and consider the problem of computing a

lower bound for the number of processors needed to satisfy a given linear sche-

dule. That is, we show how to automatically construct a formula for the num-ber of lattice points inside a linearly parameterized family of convex polyhedra,

by automatically constructing a formula for the number of solutions to the cor-

responding linearly parameterized system of linear Diophantine equations. The

algorithms for doing this and its implementation is, we believe, a significant

contribution.

The use of linear Diophantine equations is well-motivated: the computation

of an inner loop are typically defined over a set of indices that can be described

as the lattice points in a convex polyhedron [56]. Indeed, in two languages, SDEF[15] and ALPHA [55], one expressly defines domains of computation as the inte-

ger points contained in some programmer-specified convex polyhedron.

The general setting exemplified by the Tensor Product problem is as follows:

Suppose a (also denoted by A) is an r · s integral matrix, and b and c are r-

dimensional integral vectors. Suppose further that, for every n P 0, the linear

Diophantine system

az ¼ nbþ c; ð21Þ

i.e.

a11z1 þ a12z2 þ � � � þ a1szs ¼ b1nþ c1a21z1 þ a22z2 þ � � � þ a2szs ¼ b2nþ c2

..

.

ar1z1 þ ar2z2 þ � � � þ arszs ¼ brnþ cr

ð22Þ


In the non-negative integral variables z1,z2, . . . , zs has a finite number of solu-

tions. Let dn denote the number of solutions for n. The generating function of

the sequence dn is

f ðtÞ ¼XnP0

dntn: ð23Þ

For a linear Diophantine system of the form (22), f(t) is always a rational func-

tion, and we provide an algorithm to compute f(t) symbolically. The Mathe-

matica program implementing the algorithm also constructs a formula for

the numbers dn from this generating function.

Given a nested for loop, the procedure to follow is informally as follows:

1. Write down the node space as a system of linear inequalities. The loop

bounds must be affine functions of the loop indices. The domain of compu-tation is represented by the set of lattice points inside the convex polyhe-

dron, described by this system of linear inequalities.

2. Eliminate unnecessary constraints by translating the loop indices (so that

0 6 i 6 n � 1 as opposed to 1 6 i 6 n, for example). The reason for this is

that the inequality 0 6 i is implicit in our formulation, whereas 1 6 i intro-

duces an additional constraint.

3. Transform the system of inequalities to a system of equalities by introducing

non-negative slack variables, one for each inequality.4. Augment the system with a linear schedule for the associated dag, ‘‘frozen’’

in some intermediate time value: s = s(n).5. Run the program DiophantineGF. m on the resulting data. The program

calculates the rational generating function

f ðtÞ ¼X

dntn; ð24Þ

where dn is the number of solutions to the resulting linear system of Dio-

phantine equations, and produces a formula for dn.

4.1. The algorithm

We demonstrate the algorithm on a specific instance, and sketch its proof.

Consider the linear Diophantine system

z1 � 2z2 ¼ n;

z1 þ z2 ¼ 2n;ð25Þ

in which z1 and z2 are non-negative integers. Let dn denote the number of solu-

tions to (23). Associate indeterminate k1 and k2 to the first and the second

equations, respectively, and also indeterminate t1, and t2 to the first and the sec-

ond columns of the system. Consider the product of the geometric series.


R ¼ 1

1� k11k12t1

1

1� k�21 k12t2¼

Xa1P0

ðk11k12t1Þ

a1

! Xa2P0

ðk�21 k12t2Þa2

!; ð26Þ

where the exponents of k1 and k2 in the first factor are the coefficients in the

first column and the exponents of k1 and k2 in the second factor are the coef-ficients in the second column. Individual terms arising from this product are of

the form

ka1�2a21 ka1þa22 ta11 ta22 ; ð27Þ

where a1, a2 are non-negative integers. Following MacMahon [36] makes use ofthe operator X which picks out those terms (27) in the power series expansion

whose exponents of k1 and k2 are both equal to zero (this is the k—free part of

the expansion). Thus, the contribution of the term in (27) to X(R) is non-zero if

and only if the exponents of k1 and k2 are equal to zero. If this is the case, the

contribution is ta11 ta22 if and only if z1 = a1 and z2 = a2 is a solution to the homo-

geneous system

z1 � 2z2 ¼ 0;

z1 þ z2 ¼ 0:ð28Þ

(Note that there is only a single solution to (28) in this case, but this does not

effect the general nature of the demonstration of the algorithm on this exam-

ple.) This means, in particular, that what MacMahon calls the ‘‘crude’’ gener-

ating function of the solution to the homogeneous system (28) is

1

1� k11k12t1

1

1� k�21 k12t2ð29Þ

and

X1

ð1� k11k12t1Þð1� k�21 k12t2Þ

!¼X

ta11 ta32 ; ð30Þ

where the summation is over all solutions z1 = a1 and z2 = a2 of (28).Let

Rn ¼ k�n1 k�2n2 R; ð31Þwhere the exponents of k1 and k2 are the negatives of the right hand side of firstand the second equations of (28), respectively. Then

XðRnÞ ¼X

ta11 ta22 ; ð32Þ

where now the summation is over all non-negative integral solution z 1 = a1 andz2 = a2 of (28), since generic terms arising from the expansion of R are now of

the form

ka1�2a2�n1 ka1þa2�2n2 ta11 ta22 : ð33Þ


If we let t1 = t2 = 1 then X(Rn) specializes to the number of solutions dn to (28).

Let L denote the substitution operator that sets each ti equal to 1. Then

dn ¼ LXðRnÞ ð34Þ

and the operator X commutes both with L operation and addition of series.

Thus,

f ðtÞ ¼XnP0

LXðRnÞtn ð35Þ

¼XXnP0

LXðRnÞtn !

ð36Þ

¼X1

ð1� k1k2Þð1� k�21 k2ÞXnP0

k�n1 k�2n2 tn !

: ð37Þ

SinceXnP0

k�n1 k�2n2 tn ¼ 1

1� k�11 k�22 t: ð38Þ

The generating function f(t) can be obtained by applying the operator X to the

crude generating function

F ¼ 1

ð1� k1k2Þð1� k�21 k2Þð1� k�11 k�22 tÞ: ð39Þ

Now, we make use of the identity that appears in Stanley [50] for the com-putation of the homogeneous case above, namely

1

ð1� AÞð1� BÞ ¼1

ð1� ABÞð1� AÞ þ1

ð1� ABÞð1� BÞ �1

1� AB: ð40Þ

We demonstrate the usage of this identity on the example at hand. Taking the

first two factors of (39) as (1 � A)�1 and (1 � B)�1 (i.e. A ¼ k1k2;B ¼ k�21 k2Þand using (40),

F ¼ 1

ð1� k�11 k22Þð1� k1k2Þð1� k�11 k�22 tÞ

þ 1

ð1� k�11 k22Þð1� k�21 k2Þð1� k�11 k�22 tÞ

� 1

ð1� k�11 k22Þð1� k�11 k�22 tÞ; ð41Þ

which we can write as

F ¼ F 1 þ F 2 � F 3; ð42Þ


where F1,F2 and F3 denotes the three summands above. By additivity,

f ðtÞ ¼ XðF Þ ¼ XðF 1Þ þ XðF 2Þ � XðF 3Þ: ð43Þ

Continuing this way by using the identity (40), this time on F3 with (1 � A)�1

and (1 � B)�1 as the two factors, we obtain the expansion

F 3 ¼1

ð1� k�21 tÞð1� k�11 k22Þþ 1

ð1� k�21 k�22 tÞð1� k�11 k�22 tÞ� 1

ð1� k�21 tÞð44Þ

¼ F 31 þ F 32 � F 33; ð45Þ

call a product of the form

�1ð1� AÞð1� BÞ . . . ð1� ZÞ ; ð46Þ

that may arise during this process uniformly-signed if the exponents of k1 thatappear in A,B, . . . ,Z are either all non-negative, or all non-positive; the expo-

nents of k2 that appear in A,B, . . . ,Z are either all non-negative, or all non-po-

sitive, etc. Clearly if U is such a uniformly-signed product, then X(U) isobtained from U by discarding the factors which are not purely functions of

t, as there can be no ‘‘cross cancellation’’ of any of the terms coming from dif-

ferent expansions into geometric series of the factors (1 � A)�1, (1 � B)�1, . . . ,(1 � Z)�1 of U.

The idea, then, is to use identity (40) repeatedly using pairs of appropriate

factors in such a way that the resulting products of the form (46) that arise

are all uniformly-signed. The contribution of a uniformly-signed product to

f(t) is simply the product of the terms in it that are functions of t only, andall other factors can be ignored. Each of the summands of F3 given in (45)

above, for example, are uniformly signed. Since neither term contains a factor,

which is a pure function t, the contribution of each is zero.

The problem is to pick the (1 � A)�1, (1 � B)�1 pairs at each step appropri-

ately to make sure that the process eventually ends with uniformly-signed

products only. This cannot be done arbitrarily, however. For example in the

application of the identity (40) to

1

ð1� k�11 k12Þð1� k21k12Þð1� k11k

�12 Þ

; ð47Þ

with

1� A ¼ 1� k�11 k12 ð48Þand

1� B ¼ 1� k21k12; ð49Þ


(in which the k1 exponents have opposite sign), one of the three terms produced

by the identity to be further processed is

1

ð1� k�11 k12Þð1� k11k22Þð1� k11k

�12 Þ

; ð50Þ

which is identical to (47). In particular the weight argument in Stanley [50] doesnot result in an algorithm unless the ki are processed to completion in a fixed

ordering of the indices i (e.g. first all exponents of k1 are made same signed,

then those of k2 etc.)Accordingly, we use the following criterion: given a term of the form (46),

pick the ki with the smallest i for which a negative and a positive exponent ap-

pears among A,B, . . . , Z. Use two extremes (i.e. maximum positive and mini-

mum negative exponents) of such opposite signed factors (1 � A)�1, and

(1 � B)1 of the current term in (46), and apply identity (40) with this choiceof A and B. This computational process results in a ternary tree whose leaves

are functions of t only, after the application of the operator X. The generatingfunction f(t) can then be read off as the (signed) sum of the functions that ap-

pear at the leaf nodes. The generating function is

f ðtÞ ¼ 1

ð1� tÞð1þ t þ t2Þ : ð51Þ

After the functions of t at the leaf nodes of the resulting ternary tree are

summed up and necessary algebraic simplifications are carried out.

In the case above, c = 0. Now, we consider the more general case with c 5 0.

These are the instances for which the description and the proof of the algo-

rithm is not much harder, but the extra computational effort required justifies

the use of a symbolic algebra package.

As an example, consider the Diophantine system

z1 � 2z2 ¼ n� 2;

z1 þ z2 ¼ 2nþ 3:ð52Þ

As before, let dn be the number of solutions to (52) in non-negative integers

z1,z2, and let f(t) be the generating function of the dn. As in the derivation of

the identity (37) for f(t), this time we obtain

f ðtÞ ¼ X1

ð1� k1k2Þð1� k�21 k2ÞXnP0

k�nþ21 k�2n�31 tn !

: ð53Þ

Since

XnP0

k�nþ21 k�2n�32 tn ¼ k21k�32

1� k�11 k�22 t; ð54Þ


the generating function f(t) is obtained by applying the operator X to the crude

generating function

F ¼ k21k�32

ð1� k1k2Þð1� k�21 k2Þð1� k�11 k�22 tÞ: ð55Þ

Now, we proceed as before using the identity (40), ignoring the numerator

for the time being. It is no longer true that there can be no ‘‘cross cancellation’’

of any of the terms coming from different expansions into geometric series of

the factors (1 � A)�1, (1 � B)�1, . . . ,(1 � Z)�1 in a product U of the form(46) even if the term is uniformly-signed. It could be that the exponents of

all of the k1 that appear in U are negative, and the exponents of all of the k2that appear in U are all positive, but there can be k—free terms arising from

the expansions of the products that involve k,�s, since the numerator k21k�32

can cancel terms of the terms of the form k�21 k32tk that may be produced if

we expand the factors into geometric series and multiply. The application of

X would then contribute tk from this term to the final result coming from U,

for example. The important observation is that the geometric series expansionof the terms that involve k in U need not be carried out past powers of k1 largerthan 2, and past powers of k2 smaller than �3. This means that we need to keep

track of only a polynomial in k1,k2 and t before the application of X to find the

k—free part contributed by this leaf node. In this case, this contribution may

involve a polynomial in t as well. Therefore when c 5 0, we need to calculate

with truncated Taylor expansions at the leaf nodes of the computation tree. It

is this aspect of the algorithm that is handled most efficiently (in terms of cod-

ing effort) by a symbolic algebra package such as Mathematica.

5. Complexity and remarks

The number of leaves in the generated ternary tree is exponential in

n ¼X

faigjaij; ð56Þ

where {ai}, is the set of coefficients describing the set of Diophantine equations.

The depth of recursion can be reduced somewhat, when the columns to be used

are picked carefully. It is also possible to prune the tree when the input vector c

determines that there can be no k—free terms resulting from the current matrix

(e.g., some row is all strictly positive or all negative with c = 0, or the row ele-ments are weakly negative but the corresponding ci is positive, etc.). Further-

more the set of coefficients describing the Diophantine system coming from

an array computation is not unique. Translating the polyhedron, and omitting

superfluous constraints (i.e., not in their transitive reduction) reduces the algo-

rithm�s work. Additional preprocessing may be possible (e.g., via some unitary

transform).


The fact that the algorithm has worst case exponential running time is not

surprising however; the simpler computation: ‘‘Are any processors scheduled

for a particular time step?’’, which is equivalent to ‘‘Is a particular coefficient

of the series expansion of the generating function non-zero?’’ is already known

to be an NP-complete problem [43,20]. This computational complexity is fur-

ther ameliorated by the observation that, since a formula can be automaticallyproduced from the generating function, it need to be constructed only once for

a given algorithm. In practice, array algorithms typically have a description

that is sufficiently succinct to make this automated formula production

feasible.

5.1. Matrix product

Program fragment: The standard program fragment for computing thematrix product C = A Æ B, where A and B are given n · n matrices is given

below

for i = 0 to n�1 do:

for j = 0 to n�1 do:

C[i,j] 0;

for k = 0 to n�1 do:

C [i, j] C [i, j] + A[i,k].B[k, j]endfor;

endfor;

endfor;

Dag: The cube-shaped 3D mesh (i.e. the n · n · n mesh) can be defined as

follows.

Gn�n�n ¼ ðN ; AÞ; ð57Þwhere

� N ¼ ði; j; kÞj 6 i; j; k 6 nf g; ð58Þ

� A ¼ ½ði; j; kÞ; ði0; j0; k0Þ�f g; ð59Þwhere exactly 1 of the following conditions holds

1: i0 ¼ iþ 1; ð60Þ

2: j0 ¼ jþ 1; ð61Þ

3: k0 ¼ k þ 1; ð62Þfor 1 6 i, j,k 6 n.


5.2. Lower bound

Parameterized linear Diophantine system of equations: The computational

nodes are defined by non-negative integral triplets (i, j,k) satisfying.

i 6 n� 1; ð63Þ

j 6 n� 1; ð64Þ

k 6 n� 1: ð65ÞMathematica input/output: The DiophantineGF run for even n gives

In[1]:= < < DiophantineGF. m

Loaded.Type ‘‘help’’ for instructions, ‘‘example’’ for examples.

In½2� :¼ a ¼ ff2; 2; 2; 0; 0; 0g;f1; 0; 0; 1; 0; 0g;f0; 1; 0; 0; 1; 0g;f0; 0; 1; 0; 0; 1gg;

In[3]:= b = {3,1,1,1}; c = {�2, �1, �1, �1};In[4]:= DiophantineGF[a, b, c]

Out[4] = �3t2ð1þtÞ2

ð�1þtÞ3ð1þtÞ3

In[5]:= formula

Binomial Formula : �3 3C 2þ�9þ n

2; 2

� ��

> �7C 2þ�7þ n

2;2

� �þ 5C 2þ�5þ n

2; 2

� �

> �8C 2þ�4þ n

2;2

� �þ 15C 2þ�3þ n

2; 2

� �

> �8C 2þ�2þ n

2;2

� �� 3C½�4þ n;2�

> þ9C½�3þ n;2� � 11C½�2þ n; 2�

> þ9C½�1þ n;2� þ 8C½n; 2��

=16.

Recall that C[x,k] denotes the binomial coefficient

x

k

� �¼ x!

k!ðx� kÞ! ; ð66Þ


when x is a non-negative integer, and zero otherwise. This means that in the

above formula C[2 + (n � 9)/2,2] vanishes for even n, for example. In this

way the formula simplifies to

3

4n2; ðn evenÞ:

For odd n we obtain the generating function and the formula below

Out½4� ¼ �6t3

ð�1þ tÞ3ð1þ tÞ3

and a similar formula, which simplifies to

3

4n2 � 3

4: ðn oddÞ

These cases can be combined to obtain the processor lower bound ½34n2�.

5.3. Upper bound

Schedule: The map m : N # Z3 (i.e., from nodes to processor-time) can be

defined formally as follows

time

space1

space2

0BB@

1CCA ¼

tði; j; k; Þ

s1ði; j; kÞ

s2ði; j; kÞ

0BB@

1CCA; ð67Þ

where

tði; j; kÞ ¼ iþ jþ k � 2; ð68Þ

s1ði; jÞ ¼ ðiþ j� ½n=2� � 1Þmodn ð69Þ

s2ði; jÞ ¼

i� j; if n is even or ½n=2� þ 1 6 iþ j 6 ½3n=2�;

i� jþ 1; if n is odd and ½n=2� þ 1 > iþ j;

i� j� 1; if n is odd and iþ j > ½3n=2�:

8>><>>: ð70Þ

The processor lower bound for any time-minimal schedule for this computa-tion in directed mesh clearly has 4n � 3 nodes. Any time-minimal schedule

of the 4D mesh requires at least (2/3)n3 + n/3 processors.

We now show that there is a systolic array that achieves this lower bound for

processor-time-minimal multiprocessor schedules.

We map the 4D mesh onto a 3D mesh of processors with map m: Nn !N4.

Given a mesh node (i, j,k, l) 2 Nn, m(i, j,k, l) produces a time step (its first com-

ponent), and a processor location (its last 3 components). The map m is defined

as follows.


m

i

j

k

l

0BBB@

1CCCA ¼

time

p1

p2

p3

0BBB@

1CCCA ¼

sði; j; k; lÞp1ði; j; kÞp2ði; j; kÞp3ði; j; kÞ

0BBB@

1CCCA; ð71Þ

where

sði; j; k; lÞ ¼ iþ jþ k þ l; ð72Þ

p1ði; j; k; lÞ ¼ iþ jþ k � 1modn; ð73Þ

p2ði; j; kÞ ¼2jþ k � nþ 2 if iþ jþ k < n� 1;

i� j if n� 1 6 iþ jþ k 6 2n� 2;

2jþ k � 2nþ 1 if 2n� 2 < iþ jþ k;

8><>: ð74Þ

p3ði; j; kÞ ¼iþ j if iþ jþ k < n� 1;

k if n� 1 6 iþ jþ k 6 2n� 1;

iþ j� nþ 1 if 2n� 2 < iþ jþ k:

8><>: ð75Þ

We now discuss how this mapping [44] was derived and also give a geomet-

rical interpretation.

One way to find a processor-time-minimal schedule for a 4-dimensional

cubical mesh is as follows:

• First, we assume that the n points (i, j,k, 0) to (i, j,k,n � 1) (for every fixed

i, j,k) are mapped to the same processor. This gives us n3 columns of nodesto be mapped.

• As shown previously, there are at least (2/3)n3 + n/3 nodes which must

be computed at the same time step, in a time-minimal schedule. There-

fore, there are (2/3)n3 + n/3 columns containing these nodes (since the

time steps corresponding to each node in a column are unique.) We choose

these (2/3)n3 + n/3 columns as our processor space. This accomplishes 2

things:

(1) It assigns each of these columns of nodes to the processor that computesit, and

(2) It determines the shape of the 3-dimensional processor array (before any

topological changes).

• We then map all of the remaining n3 � (2/3)n3 + n/3 columns to the proces-

sors, without violating the scheduling constraints.

The mapping m was derived by following the 3 steps above. The last step has

many valid solutions. We have chosen one that is convenient. The mapping m

has the following geometrical interpretation:


As previously mentioned, we collapse the 4-dimensional space to 3 dimen-

sions, by only concerning ourselves with (i, j,k) columns, where each of the n

points in these columns is mapped to the same processor. For that reason, l

is not a factor in the space mapping.

We divide this 3-dimensional cube of columns into 3 regions: the tetrahe-

dron formed by [the convex full of] the columns below the hyperplanei + j + k = n � 1, the tetrahedron formed by the columns above the hyperplane

i + j + k = 2n � 2, and the remaining middle region. The middle region is the

processor space.

The middle region has 2 triangular faces, along the planes i + j + k = n � 1

and i + j + k = 2n � 2. These 2 faces become the bottom and top layers of pro-

cessors. The planes between these, defined by i + j + k = c, n � 1 < c < 2n � 2,

make up the remaining layers of processors, for a total of n layers, columns are

mapped to the correct layer by the mapping function p1(i, j,k).We map the lower tetrahedron into the processor space by first translating it

along the vector (1,1,1) until its upper face, defined by the plane i + j +

k = n � 2, lies in the same plane as the middle region�s upper face, defined

by the plane i + j + k = 2n � 2. We then rotate the tetrahedron 180 degrees

about the line i � j = k, and translate it again so that the integer points of

the tetrahedron lie on integer points of the middle region. This second transla-

tion is actually done on a plane by plane basis for the planes i + j + k = c to

assure a convenient mapping. These transformations are done by the mappingfunctions p2(i, j,k) and p3(i, j,k).

The upper tetrahedron is mapped to the processor space similarly, except

that it is translated down to the lower part of the middle region.

6. Conclusion

Given a nested loop program whose underlying computation dag has nodesrepresentable as lattice points in a convex polyhedron, and multiprocessor

schedule for these nodes that is linear in the loop indices, we produce a formula

for the number of lattice points in the convex polyhedron that are scheduled

for a particular time step (which is a lower bound on the number of processors

needed to satisfy the schedule). This is done by constructing a system of para-

metric linear Diophantine equations whose solutions represent the lattice

points of interest. Our principal contribution to lower bounds is the algorithm

and its implementation for constructing the generating function from which aformula for the number of these solutions is produced.

Several examples illustrate the relationship between nested loop programs

and Diophantine equations, and are annotated with the output of a Mathe-

matica program that implements the algorithm. The algorithmic relationship

between the Diophantine equations and the generating function is illustrated


with a simple example. Proof of the algorithm�s correctness is sketched, whileillustrating its steps. The algorithm�s time complexity is exponential. However

this computational complexity should be seen in light of two facts:

• Deciding if a time step has any nodes associated with it is NP-complete; we

construct a formula for the number of such nodes.• This formula is a processor lower bound, not just for one instance of a sche-

dule computation but for a parameterized family of such computations.

In bounding the number of processors needed to satisfy a linear multi-pro-

cessor schedule for a nested loop program, we actually derived a solution to a

more general linear Diophantine problem. This leads to some interesting com-

binatorial questions of rationality and algorithm design based on more general

system of Diophantine equations.Another direction of research concerns optimizing processor-time-minimal

schedule: finding a processor-time-minimal schedule with the highest through-

put: a period-processor-time-minimal schedule. While such a schedule has been

found and proven optimal in the case of square matrix product, this area is

open otherwise. Another area concerns k-dimensional meshes. We have gener-

alized the square mesh lower bounds, yielding a bound for a square mesh of

any fixed dimension k.

References

[1] A.V. Aho, J.E. Hopcroft, J.D. Ullman, The Design and Analysis of Computer Algorithms,

Addison-Wesley Publishing Co, Reading, Mass, 1974.

[2] A. Benaini, Y. Robert, Space-time-minimal systolic arrays for Gaussian elimination and the

algebraic path problem, Parallel Comput. 15 (1990) 211–225.

[3] A. Benaini, Y. Robert, Spacetime-minimal systolic arrays for gaussian elimination and the

algebraic path problem, in: Proc. Int. Conf. On Application Specific Array Processors, IEEE

Computer Society, Princeton, 1990, pp. 746–757.

[4] A. Benaini, Y. Robert, B. Tourancheau, A new systolic architecture for the algebraic path

problem, in: J.V. McCanny, J. McWhirter, E.E. Swartzlander Jr. (Eds.), Systolic Array

Processors, Prentice-Hall, Killarney, Ireland, 1989, pp. 73–82.

[5] P. Cappello, A processor-time-minimal systolic array for cubical meshalgorithms, IEEE Trans.

Parallel Distributed Syst. 3 (1) (1992) 4–13 (Erratum: 3(3) (1992) 384).

[6] P. Cappello, Omer Egecioglu, Processor lower bound formulas for array computations and

parametric Diophantine systems, Int. J. Found. Comput. Sci. 9 (4) (1998) 351–375.

[7] P. Cappello, Omer Egecioglu, C. Scheiman, Processor-time-optimal systolic arrays, J. Parallel

Algorithms Appl.

[8] P.R. Cappello, VLSI architectures for digital signal processing. Ph.D. thesis, Princeton

University, Princeton, NJ, October 1982.

[9] P.R. Cappello, A spacetime-minimal systolic array for matrix product, in: J.V. McCanny, J.

McWhirter, E.E. Swartzlander Jr. (Eds.), Systolic Array Processors, Prentice-Hall, Killarney,

Ireland, 1989, pp. 347–356.


[10] P.R. Cappello, K. Steiglitz, Unifying VLSI array design with linear transformations, in: H.J.

Siegel, L. Siegel, (Eds.), Proc. Int. Conf. on Parallel Processing, Bellaire, MI, August 1983, pp.

448–457.

[11] P.R. Cappello, K. Steiglitz, Unifying VLSI array design with linear transformations of space-

time, in: F.P. Preparata (Ed.), Advances in Computing Research, VLSI Theory, vol. 2, JAI

Press, Inc., Greenwich, CT, 1984, pp. 23–65.

[12] Ph. Clauss, C. Mongenet, G.R. Perrin, Calculus of space-optimal mappings of systolic

algorithms on processor arrays, in: Proc. Int. Conf. on Application Specific Array Processors,

IEEE Computer Society, Princeton, 1990, pp. 4–18.

[13] P. Clauss, V. Loechner, Parametric analysis of polyhedral iteration spaces, J. VLSI Signal

Process. 19 (1998) 179–194.

[14] A. Darte, L. Khachiyan, Y. Robert, Linear scheduling is close to optimal, in: J. Fortes, E.

Lee, T. Meng (Eds.), Application Specific Array Processors, IEEE Computer Society Press,

1992, pp. 37–46.

[15] B.R. Engstrom, P.R. Cappello, The SDEF programming system, J. Parallel Distributed

Comput. 7 (1989) 201–231.

[16] R.W. Floyd, Algorithm 97: shortest path, Commun. ACM 5 (1962).

[17] J.A.B. Fortes, K.-S. Fu, B.W. Wah, Systematic design approaches for algorithmically

specified systolic arrays, in: V.M. Milutinovic (Ed.), Computer Architecture: Concepts and

Systems, North-Holland, Elsevier Science Publishing Co., New York, NY 10017, 1988, pp.

454–494 (Chapter 11).

[18] J.A.B. Fortes, D.I. Moldovan, Parallelism detection and algorithm transformation techniques

useful for VLSI architecture design, J. Parallel Distributed Comput. 2 (1985) 277–301.

[19] J.A.B. Fortes, F. Parisi-Presicce, Optimal linear schedules for the parallel execution of

algorithms, Int. Conf. Parallel Processing (1984) 322–328.

[20] M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-

Completeness, W.H. Freeman, San Francisco, CA, 1979.

[21] J. Granata, M. Conner, R. Tolimieri, Recursive fast algorithm and the role of the tensor

product, IEEE Trans. Signal Process. 40 (12) (1992) 2921–2930.

[22] L.J. Guibas, H.-T. Kung, C.D. Thompson, Direct VLSI implementation of combinatorial

algorithms, Proc. Caltech Conf. on VLSI (1979) 509–525.

[23] D.S. Hirschberg, Recent results on the complexity of common-subsequence problems, in: D.

Sankoff, J.B. Kruskal (Eds.), Time Warps, String Edits, and Macromolecules: The Theory and

Practice of Sequence Comparison, Addison-Wesley, Reading, Mass, 1983.

[24] O.H. Ibarra, M. Palis, VLSI algorithms for solving recurrence equations and applications,

IEEE Trans. Acoust, Speech, Signal Process. ASSP-35 (7) (1987) 1046–1064.

[25] G.j. Li, B.W. Wah, The design of optimal systolic algorithms, IEEE Trans. Comput. C-34 (1)

(1985) 66–77.

[26] R.M. Karrp, R.E. Miller, S. Winograd, Properties of a model for parallel computations:

determinacy, termination, queueing, SIAM J. Appl. Math. 14 (1966) 1390–1411.

[27] R.M. Karp, R.E. Miller, S. Winograd, The organization of computations for uniform

recurrence equations, J. ACM 14 (1967) 563–590.

[28] E.V. Krishnamurthy, M. Kunde, M. Schimmler, H. Schroder, Systolic algorithm for tensor

products of matrices: implementation and applications, Parallel Comput. 13 (3) (1990) 301–

308.

[29] E.V. Krishnamurthy, H. Schroder, Systolic algorithm for multivariable approximation using

tensor products of basis functions, Parallel Comput. 17 (4–5) (1991) 483–492.

[30] H.-T. Kung, C.E. Leiserson, Algorithms for VLSI processor arrays, in: Introduction to VLSI

Systems, Addison-Wesley Publishing Co, Menlo Park, CA, 1980, pp. 271–292.

[31] S.-Y. Kung, S.-C. Sheng-chun Lo, P.S. Lewis, Optimal systolic design for the transitive closure

and the shortest path problem, IEEE Trans. Comput. 36 (5) (1987) 603–614.


[32] L. Lamport, The parallel execution of Do-Loops, Commun. ACM 17 (2) (1974) 83–93.

[33] W.-M. Lin, V.K. Prasanna Kumar, A note on the linear transformation method for systolic

array design, IEEE Trans. Comput. 39 (3) (1990) 393–399.

[34] B. Louka, M. Tchuente, An optimal solution for Gauss–Jordon elimination on 2D systolic

arrays, in: J.V. McCanny, J. McWhirter, E.E. Swartzlander Jr. (Eds.), Systolic Array

Processors, Prentice-Hall, Killarney, IRELAND, 1989, pp. 264–274.

[35] P.A. MacMahon, The diophantine inequality kx P ly., in: George E. Andrews (Ed.),

Collected Papers, Combinatorics, vol. I, The MIT Press, Cambridge, MASS, 1979, pp. 1212–

1232.

[36] P.A. MacMahon, Memoir on the theory of the partitions of numbers-Part II, in: George E.

Andrews (Ed.), Collected Papers, Combinatorics, vol. I, The MIT Press, Cambridge, MASS,

1979, pp. 1138–1188.

[37] P.A. MacMahon, Note on the diophantine inequality kx P ly, in: G.E. Andrews (Ed.),

Collected Papers, Cambinatorics, vol. I, The MIT Press, Cambridge, MASS, 1979, pp. 1233–

1246.

[38] D.I. Moldovan, On the design of algorithms for VLSI systolic arrays, Proc. IEEE 71 (1) (1983)

113–120.

[39] P. Quinton, Automatic synthesis of systolic arrays from uniform recurrent equations, in: Proc.

11th Symp. on Computer Architecture, 1984, pp. 208–214.

[40] S.V. Rajopadhye, S. Purushothaman, R.M. Fujimoto, On synthesizing systolic arrays from

recurrence equations with linear dependencies, in: K.V. Nori (Ed.), Lecture Notes in

Computer Science, Foundations of Software Technology and Theoretical Computer Science,

vol. 241, Springer-Verlag, 1986, pp. 488–503.

[41] Y. Robert, D. Trystram, An orthogonal systolic array for the algebraic path problem, in: Int.

Workshop Systolic Arrays, 1986.

[42] G. Rote, A systolic array algorithm for the algebraic path problem (shortest paths; matrix

inversion), Computing 34 (3) (1985) 191–219.

[43] S. Sahni, Computational related problems, SIAM J. Comput. 3 (1974) 262–279.

[44] C. Scheiman, Mapping fundamental algorithms onto multiprocessor architectures. Ph.D.

thesis, UC Santa Barbara, Department of Computer Science, [email protected], December

1993.

[45] C. Scheiman, P. Cappello, A processor-time minimal systolic array for the 3D rectilinear mesh,

in: Proc. Int. Conf. On Application Specific Array Processors, Strasbourg, France, July 1995,

pp. 26–33.

[46] C. Scheiman, P.R. Cappello, A processor-time minimal systolic array for transitive closure, in:

Proc. Int Conf. on Application Specific Array Processors, IEEE Computer Society, Princeton,

1990, pp. 19–30.

[47] C. Scheiman, P.R. Cappello, A processor-time minimal systolic array for transitive closure,

IEEE Trans. Parallel Distributed Syst. 3 (3) (1992) 257–269.

[48] C.J. Scheiman, P. Cappello, A period-processor-time-minimal schedule for cubical mesh

algorithms, IEEE Trans. Parallel Distributed Syst. 5 (3) (1994) 274–280.

[49] W. Shang, J.A.B. Fortes, Time optimal linear schedule for algorithms with uniform

dependencies, IEEE Trans. Comput. 40 (6) (1991) 723–742.

[50] R.P. Stanley, Linear homogeneous diophantine equations and magic labelings of graphs, Duke

Math. J. 40 (1973) 607–632.

[51] R.P. Stanley, Enumerative Combinatorics, vol. I, Wadsworth & Brooks i Cole, Monterey, CA,

1986.

[52] B. Sturmfels, Grobner Bases and Convex Polytopes, AMS University Lecture Series,

Providence, RI, 1995.

[53] B. Strumfels, On vector partition functions, J. Combinatorial Theory, Ser. A 72 (1995) 302–

309.


[54] J.D. Ullman, Computational Aspects of VLSI, Computer Science Press, Inc, Rockville, MD

20850, 1984.

[55] H. Le Verge, C. Mauras, P. Quinton, The ALPHA language and its use for the design of

systolic arrays, J. VLSI Signal Process. 3 (1991) 173–182.

[56] S. Warshall, A theorem on Boolean matrices, J. ACM 9 (Jan) (1962).

[57] D.K. Wilde, A library for doing polyhedral operations, Master�s thesis, Corvallis, Oregon,

December 1993, Also published as IRISA technical report PI 785, Rennes, France, December

1993.

[58] Y. Wong, J.-M. Delosme, Optimization of processor count for systolic arrays, Department of

Computer Science, RR-697, Yale University, May 1989.

[59] Y. Wong, J.-M. Delosme, Space-optimal linear processor allocation for systolic array

synthesis, in: V.K. Prasanna, L.H. Canter (Eds.), Proc. 6th Int. Parallel Processing Symp.,

IEEE Computer Society Press, Baverly Hills, 1992, pp. 275–282.

[60] Kai Hwang, Faye A. Briggss, Computer Architecture and Parallel Processing, McGraw-Hill

series in computer organization and architecture, 1985.

[61] A.P. Petitet, J.J. Dongarra, Algorithmic redistribution methods for block cyclic decomposi-

tions. February 1998, ut-cs-98-382 http://www.cs.utk.edu/~library/1998.html.

[62] K. Down, High Performance Computing, O� Reilly & Associates, Inc., California, 1993.

http://www.cs.utk.edu/~library/1998.html

Documents

Optimal systolic array algorithms for tensor product