Upload
pk-mishra
View
213
Download
0
Embed Size (px)
Citation preview
Applied Mathematics and Computation 168 (2005) 496–518
www.elsevier.com/locate/amc
Optimal systolic array algorithmsfor tensor product
P.K. Mishra
Department of Applied Mathematics, Birla Institute of Technology, Mesra, Ranchi 835 215, India
Abstract
In this paper we examine the computational complexity of optimal systolic array
algorithms for tensor product. We provide a time minimal schedule that meets the com-
puted processor lower and upper bounds including one for tensor product. Our princi-
ple contribution is to find lower and upper bounds in the algorithm and its
implementation for generating function and find processor-time-minimal schedule.
� 2004 Elsevier Inc. All rights reserved.
Keywords: Computational complexity; Systolic algorithms; Tensor product; Lower and upper
bounds
1. Introduction
Systolic arrays are important for application-specific systems, they perform
well because they are highly concurrent [54,60]. Their design cost is low be-
cause, they comprise a large number of processors, they are of only one type
(or a small fixed number of types). Inter processor communication in between
nearest neighbors, hence has short latency. Nearest neighbor connections also
0096-3003/$ - see front matter � 2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.amc.2004.09.045
E-mail addresses: [email protected], [email protected]
P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518 497
imply that the VLSI circuit is devoted primarily to the processors—very little to
their interconnection. As a consequence, they are ideally suited to VLSI tech-
nology [22]. Minimizing the amount of time and number of processors needed
to perform an application reduces the application�s fabrication cost and oper-
ation costs. This can be important in applications where minimizing the size
and energy of the hardware is critical, such as space applications and consumerelectronics [16].
We consider computations suitable for systolic arrays, often called regular
array computations or systems of uniform recurrence equations [27,31,33]. Paral-
lel execution of uniform recurrence equations has been studied extensively,
from at least as far back as 1966 (e.g., [26,32,30,8,38,10,11,39,40,14]). In such
computations, the tasks to be computed are viewed as the nodes of a directed
acyclic graph (dag), where the data dependencies are represented as arcs. Given
a dag G = (N,A), a multiprocessor schedule assigns node t for processing dur-ing step s(t) on processor p(t). A valid multiprocessor schedule is subject to
two constraints: we consider computations suitable for systolic arrays, often
called regular array computations or systems of uniform recurrence relations.
In such computations, the tasks to be computed are viewed as the nodes of a
directed acyclic graph (dag), where the data dependencies are represented as
arcs. A processor-time-minimal schedule measures the minimum number of
processors needed to extract the maximum parallelism from the dag. We pres-
ent a technique for finding a lower and upper bound on the number of proces-sors needed to achieve a given schedule of an algorithm represented as a dag.
The application of this technique is illustrated with a tensor product computa-
tion. We then consider the free schedule of algorithms for matrix product. For
this problem, we provide a time-minimal processor schedule that meets the
computed processor lower bounds, including the one for tensor product.
Casuality: A node can be computed only when its children have been com-
puted at previous steps
ðu; tÞ 2 A) sðuÞ < sðtÞ: ð1ÞNon-conflict: A processor cannot compute 2 different nodes during the same
time step
sðtÞ ¼ sðuÞ ) pðtÞ 6¼ pðuÞ: ð2ÞIn what follows, we refer to a valid schedule simply as a schedule. A time-
minimal schedule for an algorithm completes in S steps, where S is the number
of nodes on a longest path in the dag. A time-minimal schedule exposes an
algorithm�s maximum parallelism. That is, it bounds the number of time steps
the algorithm needs, when infinitely many processors may be used. A proces-
sor-time-minimal schedule is a time-minimal schedule that uses as few proces-
sors as any time-minimal schedule for the algorithm. Although only one of
many performance measures. Processor-time-minimality is useful because it
498 P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518
measures the minimum number of processors needed to extract the maximum
parallelism from a dag. Being machine-independent, this measure is a more
fundamental measure than those that depend on a particular machine or archi-
tecture. This view prompted several researchers to investigate processor-time-
minimal schedules for families of dags. Processors-time-minimal systolic arrays
are easy to devise, in an ad hoc manner for 2D systolic algorithms. This appar-ently is not the case for 3D systolic algorithms. There have been several pub-
lications regarding processor-time-minimal systolic arrays for fundamentals
3D algorithms. Processor-time-minimal schedules for various fundamental
problems have been proposed in the literature: Scheiman and Cappello
[9,5,48,45] examine the dag family for matrix product; Louka and Tchuente
[34] examine the dag family for Gauss–Jordan elimination; Scheiman and
Cappello [46,47] examine the dag family for transitive closure; Benaini and
Robert and Trystram, and Rote [2–4,41,42] examine the dag families for thealgebraic path problem and Gaussian elimination. Each of the algorithms
listed in Table 1 has the property that, in its dag representation, every node
is on a longest path. Therefore, for each algorithm listed. Its free schedule is
its only time-minimal schedule. A processor lower bound for achieving time-
minimal schedule is thus a processor lower bound for achieving maximum par-
allelism with the algorithm [6,7].
Clauss et al. [12] developed a set of mathematical tools to help find a proces-
sor-time-minimal multiprocessor array for a given dag. Another approach to ageneral solution has been reported by Wong and Delosme [58,59], and Shang
and Fortes [49]. They present methods for obtaining optimal linear schedules.
That is, their processor arrays may be sub-optimal, but they get the best linear
schedule possible. Darte et al. [14] show that such schedules are close to opti-
mal, even when the constraint of linearity is relaxed. Geometric/Combinatorial
formulations of a dag�s task domain have been used in various contexts in par-
allel algorithm design as well (e.g., [26,27,32,38,39,19,18,40,12,49,55,59]; see
Fortes et al. [17] for a survey of systolic/array algorithm formulations).We present an algorithm for finding a lower bound on the number of pro-
cessors needed to achieve a given schedule of an algorithm. The application of
this technique is illustrated with a tensor product computation. Then, we apply
Table 1
Some 3D algorithms for which processor-time-minimal systolic arrays are known
Algorithm Citation Time Number of processors
Algebraic path problem [3] 5n�2 n2/3 + O(n)
Gauss–Jordan elimination [34] 4n 5n2/18 + O(n)
Gaussian elimination [3] 3n � 1 n2/4 + O(n)
Matrix product [9,5] 3n � 2 [3n2/4]
Transitive closure [46] 5n � 4 [n2/3]
Tensor product this article 4n � 3 (2n2 + n)/3
P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518 499
the technique to the free schedule of algorithms for matrix product. For this,
we provide a compatible processor schedule that meets the these processor
lower bounds (i.e., a processor-time-minimal schedule) including the one for
tensor product. We finish with some general conclusion and mention some
open problems.
One strength of our approach centers around centers around the world algo-
rithm: we are finding processor lower bounds not just for a particular dag, but
for a linearly parameterized family of dags, representing the infinitely many
problem sizes for which the algorithm works. Thus, the processor lower bound
is not number but a piecewise polynomial function [1,61,62] in the parameter
used to express different problem sizes. The formula then can be used to opti-
mize the implementation of the algorithm, not just a particular execution of it.
The processor lower bounds are produced by:
• Formulating the problem as finding, for a particular time step, the number
of processors needed by that time step as a formula for the number of inte-
ger points in a convex polyhedron.
• Representing this set, of points as the set of solutions to a linearly parame-
terized set of linear Diophantine equations.
• Computing a generating function for the number of such solutions.
• Deriving a formula from the generating function.
The ability to compute formulae for the number of solutions to a linearly
parameterized set of linear Diophantine equations has other applications for
nested loops [13], such as finding the number of instructions executed, the num-
ber of memory locations touched, and the number of I/O operations.
The strength of our algorithm—its ability to produce formulae—comes, of
course, with a price: the computational complexity of the algorithm is exponen-
tial in the size of the input (the number of bits in the coefficients of the system
of Diophantine equations). The algorithm�s complexity however is quite rea-sonable given the complexity of the problem: Determining if there are any inte-
ger solutions to the system of Diophantine equations are NP-complete [20]. We
produce a formula for the number of such solutions.
Clauss and Loechner [13] independently developed an algorithm for the
problem based on Ehrhart polynomials. In [13], they sketch an algorithm for
the problem, using the ‘‘polyhedral library’’ of Wilde [57].
2. Processor lower bounds
We present a general and uniform technique for deriving lower bounds:
given a parameterized dag family and a correspondingly parameterized lin-
ear schedule. We compute a formula for a lower bound on the number of
500 P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518
processors required by the schedule. This is much more general than the anal-
ysis of an optimal schedule for a given specific dag. The lower bounds obtained
are good; we know of no dag treatable by this method for which the lower
bounds are not also upper bounds. We believe this to be the first reported algo-
rithm and its implementation for automatically generating such formulae.
The nodes of the dag typically can be viewed as lattice points in a convexpolyhedron. Adding to these constraints the linear constraint imposed by the
schedule itself results in a linear Diophantine system of the form
Az ¼ nbþ c; ð3Þwhere the matrix A and the vectors b and c are integral, but not necessarily non-
negative. The number dn of solutions in non-negative integers z = [z1,z2, . . . , zs]t
to this linear system is a lower bound for the number of processors requiredwhen the dag corresponds to parameter n. Our algorithm produces (symboli-
cally) the generating function for the sequence dn, and from the generating func-
tion, a formula for the numbers dn. We do not make use of any special
properties of the system that reflects the fact that it comes from a dag. Thus
in (1), A can be taken to be an arbitrary r · s integral matrix, and b and c arbi-
trary r-dimensional integral vectors. As such we actually solve a more general
combinatorial problem of constructing the generating functionP
nP0dntn, anda formula for dn given a matrixA and vectors b and c, for which the lower boundcomputation is a special case. There is a large body of literature concerning lat-
tice points in convex polytopes and numerous interesting results: see for exam-
ple Stanley [51] for Ehrhart polynomials (Clauss and Loechner [13] use these),
and Sturmfels [52,53] for vector partitions and other mathematical treatments.
Our results are based mainly on MacMohan [35,37], and Stanley [50].
3. Tensor product
We examine the dag family for the 4D mesh: the n · n · n · n directed mesh.
This family is fundamental, representing a communication-localized version of
the standard algorithm for tensor product (also known as Kronecker product).
The tensor product is used in many mathematical computations, including
multivariable spline blending and image processing [21], multivariable approxi-
mation algorithms (used in graphics, optics, and topography) [29], as well as
many recursive algorithms [28].The tensor product of an n · m matrix A and a matrix B is
A� B ¼a11B a12B . . . a1mB
. . .
a B a B . . . a B
264
375: ð4Þ
n1 n2 nm
P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518 501
The 4D mesh also represents other algorithms, such as the least common sub-
sequence problem [24,23] for 4 strings, and matrix comparison (an extension of
tuple comparison [25]).
3.1. The dag
The 4D mesh can be defined as follows. Gn = (Nn,An), where
� Nn ¼ ði; j; k; lÞj0 6 i; j; k; l 6 n� 1f g; ð5Þ
� An ¼ ½ði; j; k; lÞ; ði0; j0; k0; l0Þ�jði; j; k; lÞ 2 Nn; ði0; j0; k0; l0Þ 2 Nnf ð6Þand exactly 1 of following conditions holds
1: i0 ¼ iþ 1; j0 ¼ j; k0 ¼ k; l0 ¼ l; ð7Þ
2: j0 ¼ jþ 1; i0 ¼ i; k0 ¼ k; l0 ¼ l; ð8Þ
3: k0 ¼ k þ 1; i0 ¼ i; j0 ¼ j; l0 ¼ l; ð9Þ
4: l0 ¼ lþ 1; i0 ¼ i; j0 ¼ j; k0 ¼ k; ð10Þ
3.2. The parametric linear Diophantine system of equations
The computational nodes are defined by non-negative integral 4-tuples
(i, j,k, l) satisfying
i 6 n� 1; ð11Þ
j 6 n� 1; ð12Þ
k 6 n� 1; ð13Þ
l 6 n� 1: ð14Þ
Introducing non-negative integral slack variables s1, s2, s3, s4 P 0, we obtain
the equivalent linear Diophantine system describing the computational nodes
as
i þs1 ¼ n� 1;
j þs2 ¼ n� 1;
k þs3 ¼ n� 1;
l þs4 ¼ n� 1:
ð15Þ
502 P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518
A linear schedule for the corresponding dag is given by
sði; j; k; lÞ ¼ iþ jþ k þ lþ 1: ð16ÞFor this problem s ranges from 1 to 4n � 3. The computational nodes about
halfway through the completion of the schedule satisfy the additionalconstraint
iþ jþ k þ l ¼ 2n� 2: ð17ÞAdding this constraint we obtain the augmented Diophantine system
i þj þk þl ¼ 2n� 2;
i þs1 ¼ n� 1;
j þs2 ¼ n� 1;
k þs3 ¼ n� 1;
l þs4 ¼ n� 1:
ð18Þ
Therefore, a lower bound for the number of processors needed for the Tensor
Product problem is the number of solutions to (17). The corresponding Dio-
phantine system is
az ¼ nbþ c ð19Þwhere
a ¼
1 1 1 1 0 0 0 0
1 0 0 0 1 0 0 0
0 1 0 0 0 1 0 0
0 0 1 0 0 0 1 0
0 0 0 1 0 0 0 1
26666664
37777775; b ¼
2
1
1
1
1
26666664
37777775; c ¼
�2�1�1�1�1
26666664
37777775: ð20Þ
3.3. The Mathematica program input and output
The DiophantineGF run on this data gives
In[1]:=�DiophantineGF. m
Loaded.Type ‘‘help’’ for instructions, ‘‘example’’ for examples.
In½2� :¼ a ¼ff1; 1; 1; 1; 0; 0; 0; 0g;f1; 0; 0; 0; 1; 0; 0; 0g;
f0; 1; 0; 0; 0; 1; 0; 0g;f0; 0; 1; 0; 0; 0; 1; 0g;f0; 0; 0; 1; 0; 0; 0; 1gg;
P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518 503
In[3]:= b = {2, 1, 1, 1, 1}; c = {�2,�1,�1,�1,�1};In[4]:= DiophantineGF [a, b, c]
Out[4] = tð1þtÞ2ð�1þtÞ
In[5]:= formula
Binomial Formula: C[n, 3] + 2 C[1 + n, 3] + C[2 + n, 3]
Power Formulanð1þ2nÞ2ð3Þ .
Therefore,
2n3
3þ n3
is a processor lower bound for the Tensor Product problem.
4. General formulation
We now generalize this example and consider the problem of computing a
lower bound for the number of processors needed to satisfy a given linear sche-
dule. That is, we show how to automatically construct a formula for the num-ber of lattice points inside a linearly parameterized family of convex polyhedra,
by automatically constructing a formula for the number of solutions to the cor-
responding linearly parameterized system of linear Diophantine equations. The
algorithms for doing this and its implementation is, we believe, a significant
contribution.
The use of linear Diophantine equations is well-motivated: the computation
of an inner loop are typically defined over a set of indices that can be described
as the lattice points in a convex polyhedron [56]. Indeed, in two languages, SDEF[15] and ALPHA [55], one expressly defines domains of computation as the inte-
ger points contained in some programmer-specified convex polyhedron.
The general setting exemplified by the Tensor Product problem is as follows:
Suppose a (also denoted by A) is an r · s integral matrix, and b and c are r-
dimensional integral vectors. Suppose further that, for every n P 0, the linear
Diophantine system
az ¼ nbþ c; ð21Þ
i.e.
a11z1 þ a12z2 þ � � � þ a1szs ¼ b1nþ c1a21z1 þ a22z2 þ � � � þ a2szs ¼ b2nþ c2
..
.
ar1z1 þ ar2z2 þ � � � þ arszs ¼ brnþ cr
ð22Þ
504 P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518
In the non-negative integral variables z1,z2, . . . , zs has a finite number of solu-
tions. Let dn denote the number of solutions for n. The generating function of
the sequence dn is
f ðtÞ ¼XnP0
dntn: ð23Þ
For a linear Diophantine system of the form (22), f(t) is always a rational func-
tion, and we provide an algorithm to compute f(t) symbolically. The Mathe-
matica program implementing the algorithm also constructs a formula for
the numbers dn from this generating function.
Given a nested for loop, the procedure to follow is informally as follows:
1. Write down the node space as a system of linear inequalities. The loop
bounds must be affine functions of the loop indices. The domain of compu-tation is represented by the set of lattice points inside the convex polyhe-
dron, described by this system of linear inequalities.
2. Eliminate unnecessary constraints by translating the loop indices (so that
0 6 i 6 n � 1 as opposed to 1 6 i 6 n, for example). The reason for this is
that the inequality 0 6 i is implicit in our formulation, whereas 1 6 i intro-
duces an additional constraint.
3. Transform the system of inequalities to a system of equalities by introducing
non-negative slack variables, one for each inequality.4. Augment the system with a linear schedule for the associated dag, ‘‘frozen’’
in some intermediate time value: s = s(n).5. Run the program DiophantineGF. m on the resulting data. The program
calculates the rational generating function
f ðtÞ ¼X
dntn; ð24Þ
where dn is the number of solutions to the resulting linear system of Dio-
phantine equations, and produces a formula for dn.
4.1. The algorithm
We demonstrate the algorithm on a specific instance, and sketch its proof.
Consider the linear Diophantine system
z1 � 2z2 ¼ n;
z1 þ z2 ¼ 2n;ð25Þ
in which z1 and z2 are non-negative integers. Let dn denote the number of solu-
tions to (23). Associate indeterminate k1 and k2 to the first and the second
equations, respectively, and also indeterminate t1, and t2 to the first and the sec-
ond columns of the system. Consider the product of the geometric series.
P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518 505
R ¼ 1
1� k11k12t1
1
1� k�21 k12t2¼
Xa1P0
ðk11k12t1Þ
a1
! Xa2P0
ðk�21 k12t2Þa2
!; ð26Þ
where the exponents of k1 and k2 in the first factor are the coefficients in the
first column and the exponents of k1 and k2 in the second factor are the coef-ficients in the second column. Individual terms arising from this product are of
the form
ka1�2a21 ka1þa22 ta11 ta22 ; ð27Þ
where a1, a2 are non-negative integers. Following MacMahon [36] makes use ofthe operator X which picks out those terms (27) in the power series expansion
whose exponents of k1 and k2 are both equal to zero (this is the k—free part of
the expansion). Thus, the contribution of the term in (27) to X(R) is non-zero if
and only if the exponents of k1 and k2 are equal to zero. If this is the case, the
contribution is ta11 ta22 if and only if z1 = a1 and z2 = a2 is a solution to the homo-
geneous system
z1 � 2z2 ¼ 0;
z1 þ z2 ¼ 0:ð28Þ
(Note that there is only a single solution to (28) in this case, but this does not
effect the general nature of the demonstration of the algorithm on this exam-
ple.) This means, in particular, that what MacMahon calls the ‘‘crude’’ gener-
ating function of the solution to the homogeneous system (28) is
1
1� k11k12t1
1
1� k�21 k12t2ð29Þ
and
X1
ð1� k11k12t1Þð1� k�21 k12t2Þ
!¼X
ta11 ta32 ; ð30Þ
where the summation is over all solutions z1 = a1 and z2 = a2 of (28).Let
Rn ¼ k�n1 k�2n2 R; ð31Þwhere the exponents of k1 and k2 are the negatives of the right hand side of firstand the second equations of (28), respectively. Then
XðRnÞ ¼X
ta11 ta22 ; ð32Þ
where now the summation is over all non-negative integral solution z 1 = a1 andz2 = a2 of (28), since generic terms arising from the expansion of R are now of
the form
ka1�2a2�n1 ka1þa2�2n2 ta11 ta22 : ð33Þ
506 P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518
If we let t1 = t2 = 1 then X(Rn) specializes to the number of solutions dn to (28).
Let L denote the substitution operator that sets each ti equal to 1. Then
dn ¼ LXðRnÞ ð34Þ
and the operator X commutes both with L operation and addition of series.
Thus,
f ðtÞ ¼XnP0
LXðRnÞtn ð35Þ
¼XXnP0
LXðRnÞtn !
ð36Þ
¼X1
ð1� k1k2Þð1� k�21 k2ÞXnP0
k�n1 k�2n2 tn !
: ð37Þ
SinceXnP0
k�n1 k�2n2 tn ¼ 1
1� k�11 k�22 t: ð38Þ
The generating function f(t) can be obtained by applying the operator X to the
crude generating function
F ¼ 1
ð1� k1k2Þð1� k�21 k2Þð1� k�11 k�22 tÞ: ð39Þ
Now, we make use of the identity that appears in Stanley [50] for the com-putation of the homogeneous case above, namely
1
ð1� AÞð1� BÞ ¼1
ð1� ABÞð1� AÞ þ1
ð1� ABÞð1� BÞ �1
1� AB: ð40Þ
We demonstrate the usage of this identity on the example at hand. Taking the
first two factors of (39) as (1 � A)�1 and (1 � B)�1 (i.e. A ¼ k1k2;B ¼ k�21 k2Þand using (40),
F ¼ 1
ð1� k�11 k22Þð1� k1k2Þð1� k�11 k�22 tÞ
þ 1
ð1� k�11 k22Þð1� k�21 k2Þð1� k�11 k�22 tÞ
� 1
ð1� k�11 k22Þð1� k�11 k�22 tÞ; ð41Þ
which we can write as
F ¼ F 1 þ F 2 � F 3; ð42Þ
P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518 507
where F1,F2 and F3 denotes the three summands above. By additivity,
f ðtÞ ¼ XðF Þ ¼ XðF 1Þ þ XðF 2Þ � XðF 3Þ: ð43Þ
Continuing this way by using the identity (40), this time on F3 with (1 � A)�1
and (1 � B)�1 as the two factors, we obtain the expansion
F 3 ¼1
ð1� k�21 tÞð1� k�11 k22Þþ 1
ð1� k�21 k�22 tÞð1� k�11 k�22 tÞ� 1
ð1� k�21 tÞð44Þ
¼ F 31 þ F 32 � F 33; ð45Þ
call a product of the form
�1ð1� AÞð1� BÞ . . . ð1� ZÞ ; ð46Þ
that may arise during this process uniformly-signed if the exponents of k1 thatappear in A,B, . . . ,Z are either all non-negative, or all non-positive; the expo-
nents of k2 that appear in A,B, . . . ,Z are either all non-negative, or all non-po-
sitive, etc. Clearly if U is such a uniformly-signed product, then X(U) isobtained from U by discarding the factors which are not purely functions of
t, as there can be no ‘‘cross cancellation’’ of any of the terms coming from dif-
ferent expansions into geometric series of the factors (1 � A)�1, (1 � B)�1, . . . ,(1 � Z)�1 of U.
The idea, then, is to use identity (40) repeatedly using pairs of appropriate
factors in such a way that the resulting products of the form (46) that arise
are all uniformly-signed. The contribution of a uniformly-signed product to
f(t) is simply the product of the terms in it that are functions of t only, andall other factors can be ignored. Each of the summands of F3 given in (45)
above, for example, are uniformly signed. Since neither term contains a factor,
which is a pure function t, the contribution of each is zero.
The problem is to pick the (1 � A)�1, (1 � B)�1 pairs at each step appropri-
ately to make sure that the process eventually ends with uniformly-signed
products only. This cannot be done arbitrarily, however. For example in the
application of the identity (40) to
1
ð1� k�11 k12Þð1� k21k12Þð1� k11k
�12 Þ
; ð47Þ
with
1� A ¼ 1� k�11 k12 ð48Þand
1� B ¼ 1� k21k12; ð49Þ
508 P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518
(in which the k1 exponents have opposite sign), one of the three terms produced
by the identity to be further processed is
1
ð1� k�11 k12Þð1� k11k22Þð1� k11k
�12 Þ
; ð50Þ
which is identical to (47). In particular the weight argument in Stanley [50] doesnot result in an algorithm unless the ki are processed to completion in a fixed
ordering of the indices i (e.g. first all exponents of k1 are made same signed,
then those of k2 etc.)Accordingly, we use the following criterion: given a term of the form (46),
pick the ki with the smallest i for which a negative and a positive exponent ap-
pears among A,B, . . . , Z. Use two extremes (i.e. maximum positive and mini-
mum negative exponents) of such opposite signed factors (1 � A)�1, and
(1 � B)1 of the current term in (46), and apply identity (40) with this choiceof A and B. This computational process results in a ternary tree whose leaves
are functions of t only, after the application of the operator X. The generatingfunction f(t) can then be read off as the (signed) sum of the functions that ap-
pear at the leaf nodes. The generating function is
f ðtÞ ¼ 1
ð1� tÞð1þ t þ t2Þ : ð51Þ
After the functions of t at the leaf nodes of the resulting ternary tree are
summed up and necessary algebraic simplifications are carried out.
In the case above, c = 0. Now, we consider the more general case with c 5 0.
These are the instances for which the description and the proof of the algo-
rithm is not much harder, but the extra computational effort required justifies
the use of a symbolic algebra package.
As an example, consider the Diophantine system
z1 � 2z2 ¼ n� 2;
z1 þ z2 ¼ 2nþ 3:ð52Þ
As before, let dn be the number of solutions to (52) in non-negative integers
z1,z2, and let f(t) be the generating function of the dn. As in the derivation of
the identity (37) for f(t), this time we obtain
f ðtÞ ¼ X1
ð1� k1k2Þð1� k�21 k2ÞXnP0
k�nþ21 k�2n�31 tn !
: ð53Þ
Since
XnP0
k�nþ21 k�2n�32 tn ¼ k21k�32
1� k�11 k�22 t; ð54Þ
P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518 509
the generating function f(t) is obtained by applying the operator X to the crude
generating function
F ¼ k21k�32
ð1� k1k2Þð1� k�21 k2Þð1� k�11 k�22 tÞ: ð55Þ
Now, we proceed as before using the identity (40), ignoring the numerator
for the time being. It is no longer true that there can be no ‘‘cross cancellation’’
of any of the terms coming from different expansions into geometric series of
the factors (1 � A)�1, (1 � B)�1, . . . ,(1 � Z)�1 in a product U of the form(46) even if the term is uniformly-signed. It could be that the exponents of
all of the k1 that appear in U are negative, and the exponents of all of the k2that appear in U are all positive, but there can be k—free terms arising from
the expansions of the products that involve k,�s, since the numerator k21k�32
can cancel terms of the terms of the form k�21 k32tk that may be produced if
we expand the factors into geometric series and multiply. The application of
X would then contribute tk from this term to the final result coming from U,
for example. The important observation is that the geometric series expansionof the terms that involve k in U need not be carried out past powers of k1 largerthan 2, and past powers of k2 smaller than �3. This means that we need to keep
track of only a polynomial in k1,k2 and t before the application of X to find the
k—free part contributed by this leaf node. In this case, this contribution may
involve a polynomial in t as well. Therefore when c 5 0, we need to calculate
with truncated Taylor expansions at the leaf nodes of the computation tree. It
is this aspect of the algorithm that is handled most efficiently (in terms of cod-
ing effort) by a symbolic algebra package such as Mathematica.
5. Complexity and remarks
The number of leaves in the generated ternary tree is exponential in
n ¼X
faigjaij; ð56Þ
where {ai}, is the set of coefficients describing the set of Diophantine equations.
The depth of recursion can be reduced somewhat, when the columns to be used
are picked carefully. It is also possible to prune the tree when the input vector c
determines that there can be no k—free terms resulting from the current matrix
(e.g., some row is all strictly positive or all negative with c = 0, or the row ele-ments are weakly negative but the corresponding ci is positive, etc.). Further-
more the set of coefficients describing the Diophantine system coming from
an array computation is not unique. Translating the polyhedron, and omitting
superfluous constraints (i.e., not in their transitive reduction) reduces the algo-
rithm�s work. Additional preprocessing may be possible (e.g., via some unitary
transform).
510 P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518
The fact that the algorithm has worst case exponential running time is not
surprising however; the simpler computation: ‘‘Are any processors scheduled
for a particular time step?’’, which is equivalent to ‘‘Is a particular coefficient
of the series expansion of the generating function non-zero?’’ is already known
to be an NP-complete problem [43,20]. This computational complexity is fur-
ther ameliorated by the observation that, since a formula can be automaticallyproduced from the generating function, it need to be constructed only once for
a given algorithm. In practice, array algorithms typically have a description
that is sufficiently succinct to make this automated formula production
feasible.
5.1. Matrix product
Program fragment: The standard program fragment for computing thematrix product C = A Æ B, where A and B are given n · n matrices is given
below
for i = 0 to n�1 do:
for j = 0 to n�1 do:
C[i,j] 0;
for k = 0 to n�1 do:
C [i, j] C [i, j] + A[i,k].B[k, j]endfor;
endfor;
endfor;
Dag: The cube-shaped 3D mesh (i.e. the n · n · n mesh) can be defined as
follows.
Gn�n�n ¼ ðN ; AÞ; ð57Þwhere
� N ¼ ði; j; kÞj 6 i; j; k 6 nf g; ð58Þ
� A ¼ ½ði; j; kÞ; ði0; j0; k0Þ�f g; ð59Þwhere exactly 1 of the following conditions holds
1: i0 ¼ iþ 1; ð60Þ
2: j0 ¼ jþ 1; ð61Þ
3: k0 ¼ k þ 1; ð62Þfor 1 6 i, j,k 6 n.
P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518 511
5.2. Lower bound
Parameterized linear Diophantine system of equations: The computational
nodes are defined by non-negative integral triplets (i, j,k) satisfying.
i 6 n� 1; ð63Þ
j 6 n� 1; ð64Þ
k 6 n� 1: ð65ÞMathematica input/output: The DiophantineGF run for even n gives
In[1]:= < < DiophantineGF. m
Loaded.Type ‘‘help’’ for instructions, ‘‘example’’ for examples.
In½2� :¼ a ¼ ff2; 2; 2; 0; 0; 0g;f1; 0; 0; 1; 0; 0g;f0; 1; 0; 0; 1; 0g;f0; 0; 1; 0; 0; 1gg;
In[3]:= b = {3,1,1,1}; c = {�2, �1, �1, �1};In[4]:= DiophantineGF[a, b, c]
Out[4] = �3t2ð1þtÞ2
ð�1þtÞ3ð1þtÞ3
In[5]:= formula
Binomial Formula : �3 3C 2þ�9þ n
2; 2
� ���
> �7C 2þ�7þ n
2;2
� �þ 5C 2þ�5þ n
2; 2
� �
> �8C 2þ�4þ n
2;2
� �þ 15C 2þ�3þ n
2; 2
� �
> �8C 2þ�2þ n
2;2
� �� 3C½�4þ n;2�
> þ9C½�3þ n;2� � 11C½�2þ n; 2�
> þ9C½�1þ n;2� þ 8C½n; 2���
=16.
Recall that C[x,k] denotes the binomial coefficient
x
k
� �¼ x!
k!ðx� kÞ! ; ð66Þ
512 P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518
when x is a non-negative integer, and zero otherwise. This means that in the
above formula C[2 + (n � 9)/2,2] vanishes for even n, for example. In this
way the formula simplifies to
3
4n2; ðn evenÞ:
For odd n we obtain the generating function and the formula below
Out½4� ¼ �6t3
ð�1þ tÞ3ð1þ tÞ3
and a similar formula, which simplifies to
3
4n2 � 3
4: ðn oddÞ
These cases can be combined to obtain the processor lower bound ½34n2�.
5.3. Upper bound
Schedule: The map m : N # Z3 (i.e., from nodes to processor-time) can be
defined formally as follows
time
space1
space2
0BB@
1CCA ¼
tði; j; k; Þ
s1ði; j; kÞ
s2ði; j; kÞ
0BB@
1CCA; ð67Þ
where
tði; j; kÞ ¼ iþ jþ k � 2; ð68Þ
s1ði; jÞ ¼ ðiþ j� ½n=2� � 1Þmodn ð69Þ
s2ði; jÞ ¼
i� j; if n is even or ½n=2� þ 1 6 iþ j 6 ½3n=2�;
i� jþ 1; if n is odd and ½n=2� þ 1 > iþ j;
i� j� 1; if n is odd and iþ j > ½3n=2�:
8>><>>: ð70Þ
The processor lower bound for any time-minimal schedule for this computa-tion in directed mesh clearly has 4n � 3 nodes. Any time-minimal schedule
of the 4D mesh requires at least (2/3)n3 + n/3 processors.
We now show that there is a systolic array that achieves this lower bound for
processor-time-minimal multiprocessor schedules.
We map the 4D mesh onto a 3D mesh of processors with map m: Nn !N4.
Given a mesh node (i, j,k, l) 2 Nn, m(i, j,k, l) produces a time step (its first com-
ponent), and a processor location (its last 3 components). The map m is defined
as follows.
P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518 513
m
i
j
k
l
0BBB@
1CCCA ¼
time
p1
p2
p3
0BBB@
1CCCA ¼
sði; j; k; lÞp1ði; j; kÞp2ði; j; kÞp3ði; j; kÞ
0BBB@
1CCCA; ð71Þ
where
sði; j; k; lÞ ¼ iþ jþ k þ l; ð72Þ
p1ði; j; k; lÞ ¼ iþ jþ k � 1modn; ð73Þ
p2ði; j; kÞ ¼2jþ k � nþ 2 if iþ jþ k < n� 1;
i� j if n� 1 6 iþ jþ k 6 2n� 2;
2jþ k � 2nþ 1 if 2n� 2 < iþ jþ k;
8><>: ð74Þ
p3ði; j; kÞ ¼iþ j if iþ jþ k < n� 1;
k if n� 1 6 iþ jþ k 6 2n� 1;
iþ j� nþ 1 if 2n� 2 < iþ jþ k:
8><>: ð75Þ
We now discuss how this mapping [44] was derived and also give a geomet-
rical interpretation.
One way to find a processor-time-minimal schedule for a 4-dimensional
cubical mesh is as follows:
• First, we assume that the n points (i, j,k, 0) to (i, j,k,n � 1) (for every fixed
i, j,k) are mapped to the same processor. This gives us n3 columns of nodesto be mapped.
• As shown previously, there are at least (2/3)n3 + n/3 nodes which must
be computed at the same time step, in a time-minimal schedule. There-
fore, there are (2/3)n3 + n/3 columns containing these nodes (since the
time steps corresponding to each node in a column are unique.) We choose
these (2/3)n3 + n/3 columns as our processor space. This accomplishes 2
things:
(1) It assigns each of these columns of nodes to the processor that computesit, and
(2) It determines the shape of the 3-dimensional processor array (before any
topological changes).
• We then map all of the remaining n3 � (2/3)n3 + n/3 columns to the proces-
sors, without violating the scheduling constraints.
The mapping m was derived by following the 3 steps above. The last step has
many valid solutions. We have chosen one that is convenient. The mapping m
has the following geometrical interpretation:
514 P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518
As previously mentioned, we collapse the 4-dimensional space to 3 dimen-
sions, by only concerning ourselves with (i, j,k) columns, where each of the n
points in these columns is mapped to the same processor. For that reason, l
is not a factor in the space mapping.
We divide this 3-dimensional cube of columns into 3 regions: the tetrahe-
dron formed by [the convex full of] the columns below the hyperplanei + j + k = n � 1, the tetrahedron formed by the columns above the hyperplane
i + j + k = 2n � 2, and the remaining middle region. The middle region is the
processor space.
The middle region has 2 triangular faces, along the planes i + j + k = n � 1
and i + j + k = 2n � 2. These 2 faces become the bottom and top layers of pro-
cessors. The planes between these, defined by i + j + k = c, n � 1 < c < 2n � 2,
make up the remaining layers of processors, for a total of n layers, columns are
mapped to the correct layer by the mapping function p1(i, j,k).We map the lower tetrahedron into the processor space by first translating it
along the vector (1,1,1) until its upper face, defined by the plane i + j +
k = n � 2, lies in the same plane as the middle region�s upper face, defined
by the plane i + j + k = 2n � 2. We then rotate the tetrahedron 180 degrees
about the line i � j = k, and translate it again so that the integer points of
the tetrahedron lie on integer points of the middle region. This second transla-
tion is actually done on a plane by plane basis for the planes i + j + k = c to
assure a convenient mapping. These transformations are done by the mappingfunctions p2(i, j,k) and p3(i, j,k).
The upper tetrahedron is mapped to the processor space similarly, except
that it is translated down to the lower part of the middle region.
6. Conclusion
Given a nested loop program whose underlying computation dag has nodesrepresentable as lattice points in a convex polyhedron, and multiprocessor
schedule for these nodes that is linear in the loop indices, we produce a formula
for the number of lattice points in the convex polyhedron that are scheduled
for a particular time step (which is a lower bound on the number of processors
needed to satisfy the schedule). This is done by constructing a system of para-
metric linear Diophantine equations whose solutions represent the lattice
points of interest. Our principal contribution to lower bounds is the algorithm
and its implementation for constructing the generating function from which aformula for the number of these solutions is produced.
Several examples illustrate the relationship between nested loop programs
and Diophantine equations, and are annotated with the output of a Mathe-
matica program that implements the algorithm. The algorithmic relationship
between the Diophantine equations and the generating function is illustrated
P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518 515
with a simple example. Proof of the algorithm�s correctness is sketched, whileillustrating its steps. The algorithm�s time complexity is exponential. However
this computational complexity should be seen in light of two facts:
• Deciding if a time step has any nodes associated with it is NP-complete; we
construct a formula for the number of such nodes.• This formula is a processor lower bound, not just for one instance of a sche-
dule computation but for a parameterized family of such computations.
In bounding the number of processors needed to satisfy a linear multi-pro-
cessor schedule for a nested loop program, we actually derived a solution to a
more general linear Diophantine problem. This leads to some interesting com-
binatorial questions of rationality and algorithm design based on more general
system of Diophantine equations.Another direction of research concerns optimizing processor-time-minimal
schedule: finding a processor-time-minimal schedule with the highest through-
put: a period-processor-time-minimal schedule. While such a schedule has been
found and proven optimal in the case of square matrix product, this area is
open otherwise. Another area concerns k-dimensional meshes. We have gener-
alized the square mesh lower bounds, yielding a bound for a square mesh of
any fixed dimension k.
References
[1] A.V. Aho, J.E. Hopcroft, J.D. Ullman, The Design and Analysis of Computer Algorithms,
Addison-Wesley Publishing Co, Reading, Mass, 1974.
[2] A. Benaini, Y. Robert, Space-time-minimal systolic arrays for Gaussian elimination and the
algebraic path problem, Parallel Comput. 15 (1990) 211–225.
[3] A. Benaini, Y. Robert, Spacetime-minimal systolic arrays for gaussian elimination and the
algebraic path problem, in: Proc. Int. Conf. On Application Specific Array Processors, IEEE
Computer Society, Princeton, 1990, pp. 746–757.
[4] A. Benaini, Y. Robert, B. Tourancheau, A new systolic architecture for the algebraic path
problem, in: J.V. McCanny, J. McWhirter, E.E. Swartzlander Jr. (Eds.), Systolic Array
Processors, Prentice-Hall, Killarney, Ireland, 1989, pp. 73–82.
[5] P. Cappello, A processor-time-minimal systolic array for cubical meshalgorithms, IEEE Trans.
Parallel Distributed Syst. 3 (1) (1992) 4–13 (Erratum: 3(3) (1992) 384).
[6] P. Cappello, Omer Egecioglu, Processor lower bound formulas for array computations and
parametric Diophantine systems, Int. J. Found. Comput. Sci. 9 (4) (1998) 351–375.
[7] P. Cappello, Omer Egecioglu, C. Scheiman, Processor-time-optimal systolic arrays, J. Parallel
Algorithms Appl.
[8] P.R. Cappello, VLSI architectures for digital signal processing. Ph.D. thesis, Princeton
University, Princeton, NJ, October 1982.
[9] P.R. Cappello, A spacetime-minimal systolic array for matrix product, in: J.V. McCanny, J.
McWhirter, E.E. Swartzlander Jr. (Eds.), Systolic Array Processors, Prentice-Hall, Killarney,
Ireland, 1989, pp. 347–356.
516 P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518
[10] P.R. Cappello, K. Steiglitz, Unifying VLSI array design with linear transformations, in: H.J.
Siegel, L. Siegel, (Eds.), Proc. Int. Conf. on Parallel Processing, Bellaire, MI, August 1983, pp.
448–457.
[11] P.R. Cappello, K. Steiglitz, Unifying VLSI array design with linear transformations of space-
time, in: F.P. Preparata (Ed.), Advances in Computing Research, VLSI Theory, vol. 2, JAI
Press, Inc., Greenwich, CT, 1984, pp. 23–65.
[12] Ph. Clauss, C. Mongenet, G.R. Perrin, Calculus of space-optimal mappings of systolic
algorithms on processor arrays, in: Proc. Int. Conf. on Application Specific Array Processors,
IEEE Computer Society, Princeton, 1990, pp. 4–18.
[13] P. Clauss, V. Loechner, Parametric analysis of polyhedral iteration spaces, J. VLSI Signal
Process. 19 (1998) 179–194.
[14] A. Darte, L. Khachiyan, Y. Robert, Linear scheduling is close to optimal, in: J. Fortes, E.
Lee, T. Meng (Eds.), Application Specific Array Processors, IEEE Computer Society Press,
1992, pp. 37–46.
[15] B.R. Engstrom, P.R. Cappello, The SDEF programming system, J. Parallel Distributed
Comput. 7 (1989) 201–231.
[16] R.W. Floyd, Algorithm 97: shortest path, Commun. ACM 5 (1962).
[17] J.A.B. Fortes, K.-S. Fu, B.W. Wah, Systematic design approaches for algorithmically
specified systolic arrays, in: V.M. Milutinovic (Ed.), Computer Architecture: Concepts and
Systems, North-Holland, Elsevier Science Publishing Co., New York, NY 10017, 1988, pp.
454–494 (Chapter 11).
[18] J.A.B. Fortes, D.I. Moldovan, Parallelism detection and algorithm transformation techniques
useful for VLSI architecture design, J. Parallel Distributed Comput. 2 (1985) 277–301.
[19] J.A.B. Fortes, F. Parisi-Presicce, Optimal linear schedules for the parallel execution of
algorithms, Int. Conf. Parallel Processing (1984) 322–328.
[20] M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-
Completeness, W.H. Freeman, San Francisco, CA, 1979.
[21] J. Granata, M. Conner, R. Tolimieri, Recursive fast algorithm and the role of the tensor
product, IEEE Trans. Signal Process. 40 (12) (1992) 2921–2930.
[22] L.J. Guibas, H.-T. Kung, C.D. Thompson, Direct VLSI implementation of combinatorial
algorithms, Proc. Caltech Conf. on VLSI (1979) 509–525.
[23] D.S. Hirschberg, Recent results on the complexity of common-subsequence problems, in: D.
Sankoff, J.B. Kruskal (Eds.), Time Warps, String Edits, and Macromolecules: The Theory and
Practice of Sequence Comparison, Addison-Wesley, Reading, Mass, 1983.
[24] O.H. Ibarra, M. Palis, VLSI algorithms for solving recurrence equations and applications,
IEEE Trans. Acoust, Speech, Signal Process. ASSP-35 (7) (1987) 1046–1064.
[25] G.j. Li, B.W. Wah, The design of optimal systolic algorithms, IEEE Trans. Comput. C-34 (1)
(1985) 66–77.
[26] R.M. Karrp, R.E. Miller, S. Winograd, Properties of a model for parallel computations:
determinacy, termination, queueing, SIAM J. Appl. Math. 14 (1966) 1390–1411.
[27] R.M. Karp, R.E. Miller, S. Winograd, The organization of computations for uniform
recurrence equations, J. ACM 14 (1967) 563–590.
[28] E.V. Krishnamurthy, M. Kunde, M. Schimmler, H. Schroder, Systolic algorithm for tensor
products of matrices: implementation and applications, Parallel Comput. 13 (3) (1990) 301–
308.
[29] E.V. Krishnamurthy, H. Schroder, Systolic algorithm for multivariable approximation using
tensor products of basis functions, Parallel Comput. 17 (4–5) (1991) 483–492.
[30] H.-T. Kung, C.E. Leiserson, Algorithms for VLSI processor arrays, in: Introduction to VLSI
Systems, Addison-Wesley Publishing Co, Menlo Park, CA, 1980, pp. 271–292.
[31] S.-Y. Kung, S.-C. Sheng-chun Lo, P.S. Lewis, Optimal systolic design for the transitive closure
and the shortest path problem, IEEE Trans. Comput. 36 (5) (1987) 603–614.
P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518 517
[32] L. Lamport, The parallel execution of Do-Loops, Commun. ACM 17 (2) (1974) 83–93.
[33] W.-M. Lin, V.K. Prasanna Kumar, A note on the linear transformation method for systolic
array design, IEEE Trans. Comput. 39 (3) (1990) 393–399.
[34] B. Louka, M. Tchuente, An optimal solution for Gauss–Jordon elimination on 2D systolic
arrays, in: J.V. McCanny, J. McWhirter, E.E. Swartzlander Jr. (Eds.), Systolic Array
Processors, Prentice-Hall, Killarney, IRELAND, 1989, pp. 264–274.
[35] P.A. MacMahon, The diophantine inequality kx P ly., in: George E. Andrews (Ed.),
Collected Papers, Combinatorics, vol. I, The MIT Press, Cambridge, MASS, 1979, pp. 1212–
1232.
[36] P.A. MacMahon, Memoir on the theory of the partitions of numbers-Part II, in: George E.
Andrews (Ed.), Collected Papers, Combinatorics, vol. I, The MIT Press, Cambridge, MASS,
1979, pp. 1138–1188.
[37] P.A. MacMahon, Note on the diophantine inequality kx P ly, in: G.E. Andrews (Ed.),
Collected Papers, Cambinatorics, vol. I, The MIT Press, Cambridge, MASS, 1979, pp. 1233–
1246.
[38] D.I. Moldovan, On the design of algorithms for VLSI systolic arrays, Proc. IEEE 71 (1) (1983)
113–120.
[39] P. Quinton, Automatic synthesis of systolic arrays from uniform recurrent equations, in: Proc.
11th Symp. on Computer Architecture, 1984, pp. 208–214.
[40] S.V. Rajopadhye, S. Purushothaman, R.M. Fujimoto, On synthesizing systolic arrays from
recurrence equations with linear dependencies, in: K.V. Nori (Ed.), Lecture Notes in
Computer Science, Foundations of Software Technology and Theoretical Computer Science,
vol. 241, Springer-Verlag, 1986, pp. 488–503.
[41] Y. Robert, D. Trystram, An orthogonal systolic array for the algebraic path problem, in: Int.
Workshop Systolic Arrays, 1986.
[42] G. Rote, A systolic array algorithm for the algebraic path problem (shortest paths; matrix
inversion), Computing 34 (3) (1985) 191–219.
[43] S. Sahni, Computational related problems, SIAM J. Comput. 3 (1974) 262–279.
[44] C. Scheiman, Mapping fundamental algorithms onto multiprocessor architectures. Ph.D.
thesis, UC Santa Barbara, Department of Computer Science, [email protected], December
1993.
[45] C. Scheiman, P. Cappello, A processor-time minimal systolic array for the 3D rectilinear mesh,
in: Proc. Int. Conf. On Application Specific Array Processors, Strasbourg, France, July 1995,
pp. 26–33.
[46] C. Scheiman, P.R. Cappello, A processor-time minimal systolic array for transitive closure, in:
Proc. Int Conf. on Application Specific Array Processors, IEEE Computer Society, Princeton,
1990, pp. 19–30.
[47] C. Scheiman, P.R. Cappello, A processor-time minimal systolic array for transitive closure,
IEEE Trans. Parallel Distributed Syst. 3 (3) (1992) 257–269.
[48] C.J. Scheiman, P. Cappello, A period-processor-time-minimal schedule for cubical mesh
algorithms, IEEE Trans. Parallel Distributed Syst. 5 (3) (1994) 274–280.
[49] W. Shang, J.A.B. Fortes, Time optimal linear schedule for algorithms with uniform
dependencies, IEEE Trans. Comput. 40 (6) (1991) 723–742.
[50] R.P. Stanley, Linear homogeneous diophantine equations and magic labelings of graphs, Duke
Math. J. 40 (1973) 607–632.
[51] R.P. Stanley, Enumerative Combinatorics, vol. I, Wadsworth & Brooks i Cole, Monterey, CA,
1986.
[52] B. Sturmfels, Grobner Bases and Convex Polytopes, AMS University Lecture Series,
Providence, RI, 1995.
[53] B. Strumfels, On vector partition functions, J. Combinatorial Theory, Ser. A 72 (1995) 302–
309.
518 P.K. Mishra / Appl. Math. Comput. 168 (2005) 496–518
[54] J.D. Ullman, Computational Aspects of VLSI, Computer Science Press, Inc, Rockville, MD
20850, 1984.
[55] H. Le Verge, C. Mauras, P. Quinton, The ALPHA language and its use for the design of
systolic arrays, J. VLSI Signal Process. 3 (1991) 173–182.
[56] S. Warshall, A theorem on Boolean matrices, J. ACM 9 (Jan) (1962).
[57] D.K. Wilde, A library for doing polyhedral operations, Master�s thesis, Corvallis, Oregon,
December 1993, Also published as IRISA technical report PI 785, Rennes, France, December
1993.
[58] Y. Wong, J.-M. Delosme, Optimization of processor count for systolic arrays, Department of
Computer Science, RR-697, Yale University, May 1989.
[59] Y. Wong, J.-M. Delosme, Space-optimal linear processor allocation for systolic array
synthesis, in: V.K. Prasanna, L.H. Canter (Eds.), Proc. 6th Int. Parallel Processing Symp.,
IEEE Computer Society Press, Baverly Hills, 1992, pp. 275–282.
[60] Kai Hwang, Faye A. Briggss, Computer Architecture and Parallel Processing, McGraw-Hill
series in computer organization and architecture, 1985.
[61] A.P. Petitet, J.J. Dongarra, Algorithmic redistribution methods for block cyclic decomposi-
tions. February 1998, ut-cs-98-382 http://www.cs.utk.edu/~library/1998.html.
[62] K. Down, High Performance Computing, O� Reilly & Associates, Inc., California, 1993.