10
Efficient Scheduling of Fine Grain Parallelism in Loops * M. Rajagopalan and V. H. Allan Department of Computer Science, Utah State University Logan, Utah 84322-4205 [email protected] Phone:(801) 750-2022 Fax:(801) 750-3265 Abstract This paper presents a new technique for software pipelining using the petri nets. Our technique called the Petri Net Pacemaker (PNP) can create near op- timal pipelines with less algorithmic eflort than other techniques. The pacemaker is a novel idea which ex- ploits the behavior of petri nets to model the problem of scheduling operations of a loop body for software pipelining. Keywords: Software Pipelining, Loop Optimization, Fine-Grain Parallelism, Petri Nets, Scheduling 1 Introduction Software pipelining takes advantage of parallelism between iterations of a loop. Rather than optimize the body of the loop separately, as would be done in local methods, the entire loop is considered. Some tech- niques have completely unrolled' the loop so that all iterations are visible. Operations are then scheduled as early as dependencies allow. Dependencies which span several iterations are termed loop carried. One can create a dependency graph in which the nodes rep- resent operations and the arcs represent a must follow relationship. When all iterations are seen in an un- rolled fashion, the arcs must be annotated with a min time which is the time which must elapse between the time the first operation is executed and the time the second operation is executed. It is common to let one node represent an operation from all iterations. Since operations of a loop behave similarly in all iterations, this is a reasonable notation. However, a dependence from node a from the first iteration to b from the third iteration must be distinguished from a dependence be- tween a and b of the same iteration. Thus, in addition 'This work was partially supported by the National Science Complete unrolling means to replace a loop which executes Foundation under grants CDA-9100788 and CDA-9200371. N times with N copies of the loop body. to being annotated with min time, each dependency is annotated with the dif which is the difference in the iterations from which the operations come. Consider the example of Figure 1. Figure l(a) shows the original loop. Figure l(b) shows the first few iterations of the unrolled loop with arcs represent- ing data dependencies. Figure l(c) shows the data dependency graph. Most dependencies are between operations of the same iteration which is indicated by a zero as the first component of (dif,min). Notice that operation 1 is dependent on 1 from the previous iter at ion. The decision of which operations to execute to- gether at each point in time is termed a schedule. Schedules such as Figure l(e) show time progressing vertically and iterations progressing horizontally. Op- erations from various iterations which can be sched- uled at the same time are termed a parallel instruc- tion and appear in a row of the schedule. The par- allel architecture on which the instructions execute will have some limitations based on resources. The resources can represent physical devices such as the number of adders or can represent fields in a VLIW instruction. When independent operations may not execute together, it is termed a resource conflict. Though a time efficient schedule can be created by completely unrolling a loop and scheduling each oper- ation, the complexity of the task is great because of the large number of operations involved in the sched- ule and the large amount of space required to store the resulting schedule. Early attempts at software pipelining look for a pattern in the emerging sched- ule. Such a pattern is considered to be a new loop body and is scheduled using branches to the repeated code. The instructions of a repeating pattern form the steady state of the pipeline. In this paper, we will indicate a steady state by enclosing the operations in a box as shown in Figure l(e). The steady state, C, is the loop body of the new loop. Because the work of one iteration is divided into chunks and executed in parallel with the work for other iterations, it is termed 2 1072-4451/93 $3.00 0 1993 IEEE _.

[IEEE Comput. Soc. Press 26th Annual International Symposium on Microarchitecture - Austin, TX, USA (1-3 Dec. 1993)] Proceedings of the 26th Annual International Symposium on Microarchitecture

  • Upload
    vh

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE Comput. Soc. Press 26th Annual International Symposium on Microarchitecture - Austin, TX, USA (1-3 Dec. 1993)] Proceedings of the 26th Annual International Symposium on Microarchitecture

Efficient Scheduling of Fine Grain Parallelism in Loops *

M. Rajagopalan and V. H. Allan Department of Computer Science, Utah State University

Logan, Utah 84322-4205 [email protected] Phone:(801) 750-2022 Fax:(801) 750-3265

Abstract

This paper presents a new technique for software pipelining using the petri nets. Our technique called the Petri Net Pacemaker ( P N P ) can create near op- timal pipelines with less algorithmic eflort than other techniques. The pacemaker is a novel idea which ex- ploits the behavior of petri nets to model the problem of scheduling operations of a loop body for software pipelining. Keywords: Software Pipelining, Loop Optimization, Fine-Grain Parallelism, Petri Nets, Scheduling

1 Introduction

Software pipelining takes advantage of parallelism between iterations of a loop. Rather than optimize the body of the loop separately, as would be done in local methods, the entire loop is considered. Some tech- niques have completely unrolled' the loop so that all iterations are visible. Operations are then scheduled as early as dependencies allow. Dependencies which span several iterations are termed loop carried. One can create a dependency graph in which the nodes rep- resent operations and the arcs represent a must follow relationship. When all iterations are seen in an un- rolled fashion, the arcs must be annotated with a m i n time which is the time which must elapse between the time the first operation is executed and the time the second operation is executed. It is common to let one node represent an operation from all iterations. Since operations of a loop behave similarly in all iterations, this is a reasonable notation. However, a dependence from node a from the first iteration to b from the third iteration must be distinguished from a dependence be- tween a and b of the same iteration. Thus, in addition

'This work was partially supported by the National Science

Complete unrolling means to replace a loop which executes Foundation under grants CDA-9100788 and CDA-9200371.

N times with N copies of the loop body.

to being annotated with m i n time, each dependency is annotated with the d i f which is the difference in the iterations from which the operations come.

Consider the example of Figure 1. Figure l (a) shows the original loop. Figure l (b) shows the first few iterations of the unrolled loop with arcs represent- ing data dependencies. Figure l(c) shows the data dependency graph. Most dependencies are between operations of the same iteration which is indicated by a zero as the first component of ( d i f , m i n ) . Notice that operation 1 is dependent on 1 from the previous iter at ion.

The decision of which operations to execute to- gether a t each point in time is termed a schedule. Schedules such as Figure l(e) show time progressing vertically and iterations progressing horizontally. Op- erations from various iterations which can be sched- uled at the same time are termed a parallel instruc- tion and appear in a row of the schedule. The par- allel architecture on which the instructions execute will have some limitations based on resources. The resources can represent physical devices such as the number of adders or can represent fields in a VLIW instruction. When independent operations may not execute together, it is termed a resource conflict.

Though a time efficient schedule can be created by completely unrolling a loop and scheduling each oper- ation, the complexity of the task is great because of the large number of operations involved in the sched- ule and the large amount of space required to store the resulting schedule. Early attempts at software pipelining look for a pattern in the emerging sched- ule. Such a pattern is considered to be a new loop body and is scheduled using branches to the repeated code. The instructions of a repeating pattern form the steady state of the pipeline. In this paper, we will indicate a steady state by enclosing the operations in a box as shown in Figure l(e). The steady state, C, is the loop body of the new loop. Because the work of one iteration is divided into chunks and executed in parallel with the work for other iterations, it is termed

2 1072-4451/93 $3.00 0 1993 IEEE

_.

Page 2: [IEEE Comput. Soc. Press 26th Annual International Symposium on Microarchitecture - Austin, TX, USA (1-3 Dec. 1993)] Proceedings of the 26th Annual International Symposium on Microarchitecture

the software pipeline. f o r (i=l; i<=n; i++) 01: a[i + 11 = a[i] + 1 0 2 : b[i] = a[i + 11 / 2

0 3 : c[i] = b[i] + 3

0 4 : d[i]=c[i]

(4 ITERATIONS

T a[2] = Q[l] + I b[l] = Q[2]/2 Q[3] = U[2] -k 1 M ~ [ l ] = b[l] + 3 b[2] = ~ [ 3 ] / 2 E d[l] = c[l] .[2] = b[2] + 3

421 = c[2] fb\

In Figure 1, a schedule is achieved in which an it- eration of the new loop is started in every instruction. The delay between the initiation of iterations of the new loop is called the initiation interval and is the length of L, denoted e. This delay is also the slope of the schedule which is defined to be m i n l d i f of the cy- cle of the dependence graph. We extend the concept of (dif, min) to a path. Let T represent a cyclic path from a node to itself. Let min, be the sum of the min

a[4] = a[3] + 1 times on the arcs which constitute the cycle. Let d i f , b[3] = a[4]/2 be the sum of the dzf times on the constituent arcs. c [3~ = b[3] + 3 The maximum of minx/& f , for all cycles T is a lower

d[31 = c[31 bound on the length of the new loop body. When sev-

\4

~. . , (191)

a[2] = a[l]+l b[l] = a[2]/2 c[l]=b[1]/3

a[3]=a[2]+1

b[2]=a[3]/2 a[4]=a[3]+1 for (i=l;i<=n-3;i++)

d[i]=c[i] c[i+l]=b[i+l]+3 b[i+2]=a[i+3]/2 a[i+4]=a[i+3]+1

(postlude) ( C )

(d)

ITERATIONS

min

E 1 5 : 4 3 2 1 6 : 4 3

17: 4

Figure 1: (a) Loop Body Code (b) First Three Iter- ations of Unrolled Loop (c) Data Dependency Graph (d) High-level Code for Prelude and New Loop Body (Postlude Is Omitted) (e) Execution Schedule of It- erations (Time ( m i n ) Is Vertical Displacement, Itera- tion ( d i f ) Is Horizontal Displacement) In this exam- ple min = 1 and d i f = 1. The slope (min/dif) of the schedule is then 1.

- era1 iterations of the old loop exist in the new loop, the effective initiation interval is ( iteraAondt), where iteration& is the number of copies of each operation in the loop L.

There are basically four different types of soft- ware pipelining algorithms. One type is the one al- ready discussed. The technique is to unroll the loop and look for a pattern to form, using force if neces- sary [AN88, SDWX87, GWN91, ARL931. The sec- ond type estimates the final length of the new loop body, and then places each operation in a schedule of that length so that all cyclic dependencies are satis- fied [Lam88, Zak89, RST921. The third type performs transformations on the loop to facilitate the overlap- ping of iterations [EN90]. The fourth type uses an exhaustive search to find a schedule [Veg92].

For simple examples, unrolling and looking for a pattern to form works well. However, several compli- cations arise. Often, all operations in the loop do not naturally execute a t the same rate, e.g. some are con- strained to execute every third instruction while oth- ers execute every instruction. Thus, a pattern does not form without coercion. Conditional code within the loop complicates the scheduling as it is not known at scheduling time which branch will execute.2 Some- times the natural pattern of the loop does not repeat in an integral number of iterations. For example, a pattern may form which contains five instructions, but those instructions may contain two copies of each op- eration. For techniques which do not allow multiples copies of an operation to be scheduled, the results are not ideal.

'This assumes that scheduling happens after compilation and before execution.

3

Page 3: [IEEE Comput. Soc. Press 26th Annual International Symposium on Microarchitecture - Austin, TX, USA (1-3 Dec. 1993)] Proceedings of the 26th Annual International Symposium on Microarchitecture

2 Petri Nets

Using petri nets to facilitate software pipelining provides an elegant and efficient solution to these com- plications. Petri nets have been used to represent the current state of a system of interacting compo- nents. The petri net model allows us to represent op- erations as transitions, data dependencies as arcs, and the schedule as the firing order. A petri net consists of a set of places P and a set of transitions T together with arcs A between transitions and places. Figure 2 shows a dependence graph and the corresponding petri net. The transitions are represented by horizontal bars while places are represented by circles. An initial map- ping M associates with each place p , M ( p ) number of tokens, such that M ( p ) 2 0. A place p is said to be marked if M ( p ) > 0. Associated with each transition t is a set of input places Si(t) and a set of output places So(t) . The set Si(t) consists of all places p such that there is an arc from p to t in the petri net. Similarly So(t) consists of all places p such that there is an arc from t to p in the petri net. The marking at any in- stant determines the state of the petri net. Formally, the state of a petri net is the number of tokens at each place at a given point in time. The petri net changes state by firing transitions. A transition is ready to fire if for all p belonging to Si(t), M ( p ) 2 wpwhere wp is the weight of the arc between p and t. For simplicity of presentation, all arcs in this paper are assumed to have a weight of one.

When a transition fires, the number of tokens in each input place is decremented while the number of tokens in each output place is incremented. All tran- sitions fire according to the earliest firing rule; that is, they fire as soon as all their input places are marked. A steady state is reached when a series of firings of the petri net take it to a state through which it has already passed. In Figure 2(a), node 3 must follow node 0, and nodes 1 and 2 must follow node 3. In Figure 2(b), the petri net which forces this order is shown. Transition T3 cannot fire before transition TO has fired because the token will be passed to the input place p l of transition T3 only after TO has fired. There- fore, T3 is dependent on TO. The firing of a transition can be thought of as passing the result of an operation performed at a node to other nodes which are waiting for this result. Since there are no arcs between transi- tions TI and T2, these transitions are independent of each other and hence can be fired concurrently. Fig- ure 2(b)-(d) show successive states in the petri net. The resulting schedule is shown in Figure 2(e). The first column represents the marking of places a t each point in time. The second column indicates which

Figure 3: Conflict:Transitions T2 and T3 can not fire simultaneously as the input place p1 can pass the token to either Tz or T3 but not both.

transitions fire (i.e., are scheduled) a t each point in time.

A petri net can also be used to indicate that only one of the two choices can fire. In Figure 3, place p l contains only one token which can be used to fire either transition T 2 or transition T3 but not both of them simultaneously. This conflict can be resolved using a suitable algorithm.

3 Software Pipelining

Using petri nets, we are able to create a schedule based solely on data dependence. When the data de- pendencies are cyclic, the sequence of nodes which are scheduled repeats. However, if the whole data depen- dence graph is adjusted so that it is a single strongly connected ~ o m p o n e n t , ~ the rate of each firing is con- trolled. No node is allowed to be scheduled any of- tener than the slowest cycle dictates. Since all oper- ations execute a t the same rate, when a state is re- entered it is guaranteed that a repeating pattern has been located. The model called the Petri Net Pace- maker (PNP) [ARL93, Raj931 introduces a pacemaker to control the pace at which operations are scheduled. A pace of n means that each operation is scheduled every n instructions. The pacemaker is a strongly con- nected component of dummy nodes and arcs.4

Making the Dependency Graph Strongly Connected

If we force the data dependency graph to be strongly connected, the corresponding petri net will

3A strongly connected component of a directed graph is a set of nodes such that there is a directed path from every node in the set to every other node in the set.

'They are termed dummy in that they do not correspond to actual operations to be scheduled.

4

Page 4: [IEEE Comput. Soc. Press 26th Annual International Symposium on Microarchitecture - Austin, TX, USA (1-3 Dec. 1993)] Proceedings of the 26th Annual International Symposium on Microarchitecture

0 0 w #$p3

T T2

’” TO I Marked Schedule ’ @ TO I

pjP3 T T2 T T2

Figure 2: Data Dependency Graph and Corresponding Petri Net Showing Concurrency: Transitions TI and T2

are independent of each other and can fire simultaneously. (a) Data Dependency Graph (b)-(d) Successive states of the petri net (e) Schedule of Nodes

coordinate the scheduling of all parts of the graph. Since arcs in the dependence graph constrain the or- der nodes execute, we can add other constraints by modeling them as dependencies. As arcs are added, any cycle formed should have a d i f value assigned to the new arcs so that the cycle has a minldif ratio as close as possible to the lower bound for without exceeding it.

The slowest cycle (largest minldif) in the depen- dence graph determines the ideal rate a t which op- erations can be scheduled as the operations can be scheduled no faster than the slowest operation.

Let S be a dummy start node. By cyclically con- necting the pieces of the graph to the start node, the rates of all nodes are synchronized.

Creating Petri Net

Once the data dependence graph has been modi- fied so that it is strongly connected, the corresponding petri net is created as follows:

1. For each node i in the data dependence graph, a transition Ti is created.

2. For each arc in the DDG from node i to node j , a place p is created along with arcs from i to p and from p to j .

3. The initial marking of the petri net is such that for an arc from i to j with dzf value d, the corre- sponding place p has d tokens assigned to it. This allows node j to be fired up to three times before node i is fired.

4. For each resource, a place pr is created. The place is assigned the same number of tokens as the num- ber of instances of that particular resource. Re- source usage is controlled by requiring that all nodes which need the resource be connected to this place.

5. If a node uses a particular resource, then there is a self loop from that transition to the resource place and back. The number of instances of that resource that are used by that transition is the weight of input and output arcs from that re- source. Since a node needs a token on each of its input places before it can fire, an arc from a resource place requires that the resource is avail- able before scheduling the node. The arc to the resource place allows the node to return the re- source after use.

At each point in time, the transitions which fire form a parallel instruction of the schedule.

Determining the Pace

The schedule which is produced using this tech- nique will be valid. However, we can decrease the amount of time it takes to find a schedule (i.e., return to the same state) by forcing certain nodes to execute a t the estimated rate. Because the dependence graph is strongly connected, no node can execute any faster (on the average) than another node. It may take sev- eral time steps before the cyclic dependencies force the correct pace.

The pacemaker attempts to force the estimated pace so a pattern is formed quickly. In the absence

5

Page 5: [IEEE Comput. Soc. Press 26th Annual International Symposium on Microarchitecture - Austin, TX, USA (1-3 Dec. 1993)] Proceedings of the 26th Annual International Symposium on Microarchitecture

of resource conflicts, the pacemaker schedules instruc- tions at the optimal pace and produces an optimal schedule. Figure 4(a) shows the petri net, and the schedule is shown in Figure 4(b). Row 0 of Figure 4(b) indicates the initial marking of each place in the petri net; the number of tokens at each place is shown in parentheses. Each row indicates the transitions which fire given the current marking. For this schedule, it is assumed that there are no resource conflicts. Since the state a t instruction 2 and the state a t instruction 4 are identical, the pattern starts a t instruction 2 and ends at instruction 3. We get a pattern with an initiation interval of two which is marked with a box.

The pacemaker always contains the minimum num- ber of nodes required to achieve an initiation interval of I. For example, if the initiation interval is three then the pacemaker will have three nodes. If the initiation interval is 2.67 corresponding to a cyclic min of 8 and a cyclic difof 3, then the pacemaker will have eight nodes. The arcs in the pacemaker are labeled with a min value of one, while difnumber of arcs starting with S are labeled with a dif of one. The remaining arcs have a difvalue of zero.

4 Extensions

Resource Conflicts The competition for resource tokens is used to

model resource conflicts. The transition which has not fired for the longest period of time is allowed to fire. Zero Min Times

In some cases, the min time for an arc may be zero [Veg92]. Edges in the dependence graph which have a min value of zero require special treatment. The tran- sitions which make up the edge are treated as special in that the two transitions are allowed to fire concur- rently if all other data dependences are met and if there are no resource conflicts. However, the second transition cannot be fired before the first has fired. This has the effect of scheduling the second node as soon as possible after the first node has been sched- uled.

For example in Figure 5, there is an arc between nodes 3 and 5 which has a min value of zero. Fig- ure 5(c) shows the corresponding petri net while Fig- ure 5(d) gives the schedule. Transitions T5 and T3 are treated as special. When T3 fires, it passes a token to each of its output places p3 and p 2 . Since T5 is a special transition, it is allowed to fire immediately if the input place p4 also has a token. It should be

Figure 4: (a) Petri Net (b) Schedule assuming infinite resources. Dummy nodes 7 and 8 (which are part of the pacemaker) are not shown in schedule.

6

Page 6: [IEEE Comput. Soc. Press 26th Annual International Symposium on Microarchitecture - Austin, TX, USA (1-3 Dec. 1993)] Proceedings of the 26th Annual International Symposium on Microarchitecture

noted that in the schedule the place p3 is never shown to be marked as the token it receives is immediately consumed by the firing of T5. For this example, this has the effect of scheduling node 3 and node 5 at the same time. Predicates within loops

Predicates within the loop body are handled by passing the token along nodes representing both the branches and by having a merge node which does not fire until it receives the token from both the branches. For example in Figure 6, node 5 is added as a merge node. The control dependence itself is modeled as a data dependence. Predicated Execution

Predicated execution is a technique to handle con- ditionals using basic block techniques. First, predi- cates in the loop are replaced with a statement which stores the result of the test in a predicate register, R. Instructions in the true branch such as a = b + c are replaced by predicated instructions such as a = b + c if R-t which specifies that the operation will actually be completed only if the predicate R is true. Instructions on the false branch such as d=e*f would be replaced by d=e*f if R-f. Mahlke et.al. propose an architecture which supports predicated execution in which the conditionally executed operations may ex- ecute concurrently with the statement which assigns the predicate [MLC+92]. This is possible because the operation is executed regardless of the value of the predicate, but is only allowed to change the target value if the predicate value has the appropriate value. Since the stores are performed in a later part of the ex- ecution cycle than the computation of the value, this is feasible.

Predicated execution can be modeled by assigning min values of zero to the control dependent arcs of the dependence graph. In Figure 6, arcs 1 -+ 2 and 1 + 3 would be (0,O) arcs. Table 1 shows the effects of predicated execution on the length of the initiation interval on ten sample loops. The column marked Ef- fort measure the number of time steps required before a pattern is formed. It is not really a measure of exe- cution time of the scheduling algorithm as when arcs with min time of zero are encountered two scheduling steps are required but only one is counted. The Effort does give a measure of how often min times of zero are encountered. Speculative Execution

Speculative execution refers to the execution of op- erations before it is known that they will be useful. It is similar to predicated execution except that in- stead of allowing the operation to be performed in the

Schedule

1

2

3.5 1

Figure 5: (a) Dependence graph with arcs with a min- imum delay of zero. (b) Dependence graph with the pacemaker added to it. (c) Petri net (d) Schedule as- suming infinite resources.

7

Page 7: [IEEE Comput. Soc. Press 26th Annual International Symposium on Microarchitecture - Austin, TX, USA (1-3 Dec. 1993)] Proceedings of the 26th Annual International Symposium on Microarchitecture

a = b + c

d = a + c

if (d > e)

4 f = f ' f = c

g = f

g ' = ' + e

a = g + e

+1

Figure 7: (a) Original dependence graph.(b) After renaming f = c - 1 (c) Dependence graph after forward substituting f (d) After renaming g = f + e. (d) After renaming a = g + e.

Page 8: [IEEE Comput. Soc. Press 26th Annual International Symposium on Microarchitecture - Austin, TX, USA (1-3 Dec. 1993)] Proceedings of the 26th Annual International Symposium on Microarchitecture

(0,l) I I

Figure 6: (a) Original Dependence Graph (b) Depen- dence Graph After Adding Merge Node

Table 1: Predicated and Non-Predicated Execution Compared

7

~ 10

Non-Predicated Effort

12 16 8

15 10 13 6

20 16 17

e 1.5 5.0 6.0 1.5 2.5 2.3 5.0 3.3 2.0 3.0

Predic Effort

9 13 7

12 10 8 5

14 11 16

ted

1.5 4.0 5.0 1.5 2.5 2.0 4.0 2.7 1.3 3.0

- e -

~

%Change

5.88

same time step as the predicate, the operation can be performed many time steps before its usefulness is known. Renaming and forward substitution are used by Ebciogulu and Nakatani to move operations past predicates [EN90]. The length of a dependence cycle containing a control dependence is reduced when the true dependencies in the cycle are collapsed by forward substitution. Renaming is a technique which replaces the original operation by two new operations, one of which is a copy operation while the other is the origi- nal operation but whose result is assigned to another variable. The copy operation copies the value from this new variable to the original variable. Since the new variable is used only in the copy operation, it is free to move out of the predicate. Figure 7(b) shows how renaming is used to move the operation f = c- 1 past the predicate. A new variable f’ is created and is assigned the result of the operation c - 1. In order to preserve the original semantics of the program, the value assigned to f’ is copied back into f. Since f’ is used only by the copy operation, it can move past the predicate.

Forward substitution is a technique which is used to collapse true dependencies. If there is an assignment statement which simply copies the value from one vari- able to another variable, then all subsequent uses of the left hand side of the assignment could be replaced by the right hand side of the assignment, eliminat- ing true dependencies between these statements. In Figure 7(b), there is a true dependence between the operation in node 4 and that in node 5. This depen- dence is eliminated by replacing the value of f in node 5 by f’. Thus, the copy operation in node 4 can now be executed at the same time as the assignment to g in node 5. Figure 7(c) shows the result after forward substituting the value of f . The result in Figure 7(e) shows only copy operations on the true branch, and the length of the cycle has been reduced from 6 to 4. Notice that the forward substitution of the value of g in node 7’ eliminates the true dependence between nodes 4 and 7’. Thus the length of the dependence chain is reduced by one.

5 Results

Gao’s model was compared with the PNP model using twenty single basic block loops from [Jon911 as- suming no resource conflicts. The loops did not have loop carried dependencies which span more than one iteration as Gao’s model cannot handle them. The loops also did not have any predicates in them. The two models were compared in terms of the initiation

9

Page 9: [IEEE Comput. Soc. Press 26th Annual International Symposium on Microarchitecture - Austin, TX, USA (1-3 Dec. 1993)] Proceedings of the 26th Annual International Symposium on Microarchitecture

Table 2: Comparison of PNP with Gao’s model on single basic block loops.

6.0

- 1 1 3 4 5 6 7 8 9 10 11

13 14 15 16 17 18 19

la

ao - e 2.0

- - Q.0

3.0 3.0 2.0

3.0 4.0

3.0

5.0 3.0 3.0 3.0 1.7 3.0

3.0

6.0

-

a.5

9.0

a.o

2.0

a.o

- i t iat io

PNP 3.0 1.0 2.0 2.0 3.0 4 . 0

3.0

- -

1.0

1.0 2.0 3.0 3.0 3.0 1.7 2.5 2.0 1.7 1.0 6.0

1 0.0 33.3 0.0

10.0 0.0 0.0 0.0 0.0 0.0

60.0 0.0 0.0 0.0 0.0 16.7 0.0 44.4 0.0 0.0 66.7 17.9 - -

interval of the resulting pipeline. Table 2 gives a com- parison of the two models. The PNP model shows an improvement of 17% over Gao’s model. There is not a single loop where Gao’s model does better than the PNP model. The reason for this is that (in the absence of resource conflicts) the PNP model always produces the optimal initiation interval. Also, the acknowledge- ment arcs in Gao’s model create extra cycles some of which may be slower than the optimal pace, thereby producing a pipeline with a less efficient initiation in- terval.

The PNP algorithm is compared with Lam’s algo- rithm on loops with both low and high resource con- flicts [Jongl]. Table 3(a) shows the results on loops with low resource conflicts while Table 3(b) shows the same loops with high resource conflicts. With low re- source conflicts, the PNP algorithm does marginally better. With high resource conflicts the PNP algo- rithm shows a significant improvement of 9.2% over Lam’s algorithm.

Lam’s algorithm cannot achieve fractional rates without unrolling while the PNP model is able to do much better in a number of cases as it achieves frac- tional rates. Since Lam’s algorithm schedules each strongly connected component separately, a fixed or- dering of the nodes belonging to each strongly con- nected component in the schedule is used. This fixed ordering together with the resource conflicts between the nodes of the strongly connected component and other nodes of the schedule, restricts the overlap be- tween them. This forces Lam to use a higher value of 1 which in many cases is not optimal.

Also, the PNP model does not bind the nodes of a strongly connected component together. These nodes

are free to float independently of each other and are scheduled at the earliest instruction for which the re- source and the data dependency requirements of that operation are met.

The PNP algorithm is compared with Vegdahl’s technique using 15 test cases from [Veg92]. The PNP model is extended to accommodate some of the con- straints imposed on the schedule by the CNAPS archi- tecture used by Vegdahl. These constraints include (1) a(bcd) which means that operation a must be placed in an instruction that contains either operation b , opera- tion c , or operation d , and (2) a + b which means that operation a must precede operation b by exactly one instruction. The second constraint involves reshuffling the priorities of the operations so that operation b can be scheduled whenever a is scheduled in the previ- ous instruction. This reduces the effectiveness of the pacemaker as operations are no longer scheduled at the pace dictated by the pacemaker.

Table 4 compares Vegdahl’s technique with the PNP algorithm. Vegdahl’s technique performs an ex- haustive search of all the possible schedules to look for a schedule with the shortest length. PNP compares quite favorably with this exhaustive technique. The PNP actually performs better than Vegdahl’s tech- nique on code which requires a non integral initiation interval as Vegdahl’s technique is limited to integral initiation intervals. The algorithmic effort of the PNP is negligible compared to Vegdahl’s exhaustive tech- nique.

References

[AN881 A. Aiken and A. Nicolau. Optimal Loop Paralleliza- tion. In Proceedings of the SIGPLAN ‘88 Confer- ence on Programming Language Design and Imple- mentation, pages 308-317, Atlanta, GA, June 1988.

[ARL93] V.H. Allan, M. Rajagopalan, and R.M. Lee. Soft- ware Pipelining: Petri Net Pacemaker. In Work- ing Conference on Architectures and Compilation Techniques for Fine and Medium Grain Paral- lelism, Orlando, FL, January 20-22 1993.

A New Compila- tion Technique for Parallelizing Loops with Un- predictable Branches on a VLIW Architecture. In D. Gelemter, editor, Languages and Compilers for Parallel Computing, pages 213-229. MIT Press, Cambridge,MA, 1990.

A Timed Petri-Net Mbdel for Fine-Grain Loop Scheduling. In Proceedings of the A C M SIGPLAN ’91 Confer- ence on Programming Language Design and Imple- mentation, pages 204-218, June 2628 1991.

[Jon911 R.B. Jones. Constrained Soffware Pipelining. Mas- ter’s thesis, Department of Computer Science, Utah State University, Logan, UT, September 1991.

[EN901 K. Ebcioglu and T. Nakatani.

[GWN91] G.R. Gao, W-B. Wong, and Q. Ning.

10

Page 10: [IEEE Comput. Soc. Press 26th Annual International Symposium on Microarchitecture - Austin, TX, USA (1-3 Dec. 1993)] Proceedings of the 26th Annual International Symposium on Microarchitecture

Table 3: Compare to Lam's Ale

14 15

(a) low resour

3 3 0.0 3 2.5 16.7

con Loop -

1 1 3 4 5 6 7 8 9

1 0 1 1 1 2 1 3 14 15 16 1 7 1 8 1 9 20 2 1 2 2 23 24 2 5 26 2 7 2 8 2 9 3 0 3 1 3 2 3 3 34 3 5 3 6 3 7 3 8 3 9 40 4 1 42 43 14

AV2 - -

- Laop -

1 2 3 4 5 6 7 8 9

1 0 1 1 1 2 13 14 15 16 1 7 18 19 20 21 22 23 24 25 26 27 2 8 2 9 3 0 3 1 3 2 3 3 34 35 3 6 3 7 3 8 3 9 40 4 1 42 43 44

A"= - -

cts - lni4

Lam 5 3 3 3 4 3 3 3 2 3 4 3 3 4 3 3 3 3 3 3 4 3 5 5 4 5 6 3 3 2 2 3 3 3 2 2 3 4 3 2 3 5 5

-

5 3.39 - -

3 3 3 3 3 4 3 3 3 3 3 2 4 4 4 3 3 4 1 4 5 5 6 6 3 3 3 3

4 3 3 6 7 8

3.79 - -

4 3 3 3 3

1.5 2.5

3 2

1.5 2.5

3 2.5

4 3 3 3

1.5 3 3 4 3 5 4 4 3 6 2 3 2 2 3 3 3 2 2 4

4.5 3.5

2 3 5 c

7.0 3.27 - -

iation PNP

4 3

-

4 3

2.5 3

2.5 2.5

2 3.3

4 3 3 3 4 4 4 5

3 .5 4 6 3 3 2 3 3 2 2 2 3 3 6 4 3 3 6

5.67

rithm on loops with Table 4: Comparison With Vegdahl's Method

resc - ."*l -

'1. 10.0 0.0 0.0 0.0

25.0 16.7 16.7 0.0 0.0

16.7 37.5 0.0

16.7 0.0 0.0 0.0 0.0

16.7 0.0 0.0 0.0 0.0 0.0

20.0 0.0

40.0 0.0 33.3

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

-33.3 -12.5 -16.7

0.0 0.0 0.0

-20.0 -40.0

3 . 5 - -

.ervat %Change

0.0 50.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

16.7 0.0

16.7 16.7

0.0 16.7

0.0 25.0

0.0 0.0 0.0 0.0 0.0 0.0

30.0 33.3

0.0 0.0 0.0

33.3 0.0

25.0 0.0 0.0 0.0 0.0 0.0

-20.0 0.0 0.0 0.0 0.0

16.6 12.5

9.2

irce conflicts

0.0

1 2 13

[Lam881 M.S. Lam. Software Pipelining: An Effective Scheduling Technique for VLIW Machines. In PTO- ceedings of the SIGPLAN '88 Conference on Pro- gramming Language Design and Implementation, pages 318-328, Atlanta, GA, June 1988.

[MLC+92] S.A. Mahlke, D.C. Lin, W.Y. Chen, R.E. Hank, and R.A. Bringmann. Effective Compiler Support for Predicated Execution Using the Hyperblock. In Proceedings of the 25th International Symposium and Workshop on Microarchitecture (MICRO-25), Portland, OR, December 1-4 1992.

[Raj931 M. Rajagopalan. A New Model for Soflware Pipelining Using Petri Nets. Master's thesis, De- partment of Computer Science, Utah State Univer- sity, Logan, UT, July 1993.

[RST92] B. R. Rau, M. S. Schlansker, and P.P. Tirumalai. Code Generation Schema for Modulo Scheduled Loops. In proceedings of Micro-25, The 25th An- nual International Symposium on Microarchitec- ture, December 1992.

[SDWX87] B. Su, S. Ding, J. Wang, and J. Xia. GURPR - A Method for Global Software Pipelining. In P r o - ceedings of the 20th Microprogramming Workshop (MICRO-20), pages 97-105, Colorado Springs, CO, Dec 1987.

A Dynamic-Programming Technique for Compacting Loops. In Proceedings of Micro- 25, The 25th Annual International Symposium on Micro arc hit eCt Ure, December 1992.

[Zak89] A. M. Zaky. Eficient Static Scheduling of Loops on Synchronous Mu/tiprocessors. PhD thesis, Depart- ment of Computer and Information Science, Ohio State University, Columbus, OH, 1989.

[Veg92] S. Vegdahl.

11