10
ELSEVIER Microprocessors and Microsystems 21 (I 998) 523-532 MICROPROCESSORS AND MICROSYSTEMS Register-file allocation via graph coloring Faridah Ali*, Imtiaz Ahmad Department of Electrical and Computer Engineering, Kuwait University, P.O. Box 5969, Safat 13060, Kuwait Received 10 March 1996; received in revised form 12 January 1998; accepted 12 January 1998 Abstract Storage allocation is an important task in data path synthesis. In this paper, we present a heuristic technique, designated as register-file allocation (RFA), for mapping variables into single-port register files. The technique is based on coloring the conflict graph, a graph generated from the read/write conflict analysis of the variables of a scheduled data flow graph. The coloring heuristic was originally developed by Chaitin for allocating variables to registers in software compilers. We modified the heuristic to perfbrm variables grouping to register files while taking into consideration the size of each register file and the interconnections for data transfer between register files and functional units simultaneously. The RFA heuristic is simple and fast, it takes O(nmk) time, where n is the size, m is the degree of the conflict graph, and k is the number of control steps of a scheduled data flow graph. Experimental results on an elliptic filter benchmark design example show that the same quality of results as those obtained from existing synthesis systems can be obtained, with a significant improvement in the CPU time. The RFA heuristic is generic and can be applied to other coloring problems in high-level synthesis. © 1998 Elsevier Science B.V. Keywords: Data-path; Synthesis; Allocation; Register-file 1. Introduction High-level synthesis is a design automation process that generates a register transfer level (RTL) hardware from a behavioral description of a digital system. High-level synthesis is commonly achieved by dividing the task into a data path design and a control unit design. Data path synthesis consists of two major tasks: scheduling and allo- cation. Scheduling binds operations to appropriate control steps and determines the resource requirement. Allocation is divided into three subtasks: functional units (FUs) alloca- tion, register allocation and interconnection allocation [1]. Register allocation is an important step in data path synth- esis. The starting point for register allocation is lifetimes of variables achieved from a scheduled data flow graph. A variable is said to be alive between the time it is generated and the last use of it. The register allocation binds variables to registers (register files) in such a way that the lifetimes of variables bound to each register will not overlap, with the objective of minimizing the number of registers used. Variables can be allocated to register files to reduce the interconnection and to have more structured design. *Corresponding author. Tel: (965) 4811188, ext. 5826; Fax: (965) 4817451 ; e-mail: [email protected] 0141-9331/98/$19.00 © 1998 Elsevier Science B.V. All rights reserved PH S0141-933 1 (98)00047-7 Several different approaches, which use single port register files to reduce interconnections, have been proposed to solve the allocation problem. FACET [2] used a heuristic clique-partitioning technique to group variables into registers, which are then grouped into register files based on the disjoint access time. SPAID [3,4] maps the problem of grouping registers to register files to an edge-coloring problem on a bipartite multigraph. Grant and Denyer [5] first construct a square-graph between variables to record the access clash between variables of the same type of func- tional unit. Then use a hardest first heuristic to map vari- ables to single port register files. EASY [6] uses a maximal clique-partitioning approach to group registers into register files based on disjoint access time. The weight assigned to each edge is the number of times during the execution of the algorithm that values to be stored in the registers have a common origin or destination. ESC [7] first builds a state graph from a scheduled data flow graph. In a state graph a node is created for each control step and an edge is drawn fbr each storage value between its write control step and its read control step. Then it uses a general edge-coloring tech- nique on the state graph to group variables into a minimum number of register files. The problem of multiple-reads is resolved by parallel-reads or serial reads in the graph model. FLORA [8] and STAR [9] use a branch and bound techni- que to solve the allocation problem. Gebotys [10] models

Register-file allocation via graph coloring

Embed Size (px)

Citation preview

Page 1: Register-file allocation via graph coloring

E L S E V I E R Microprocessors and Microsystems 21 (I 998) 523-532

MICROPROCESSORS AND

MICROSYSTEMS

Register-file allocation via graph coloring

Faridah Ali*, Imtiaz Ahmad

Department of Electrical and Computer Engineering, Kuwait University, P.O. Box 5969, Safat 13060, Kuwait

Received 10 March 1996; received in revised form 12 January 1998; accepted 12 January 1998

Abstract

Storage allocation is an important task in data path synthesis. In this paper, we present a heuristic technique, designated as register-file allocation (RFA), for mapping variables into single-port register files. The technique is based on coloring the conflict graph, a graph generated from the read/write conflict analysis of the variables of a scheduled data flow graph. The coloring heuristic was originally developed by Chaitin for allocating variables to registers in software compilers. We modified the heuristic to perfbrm variables grouping to register files while taking into consideration the size of each register file and the interconnections for data transfer between register files and functional units simultaneously. The RFA heuristic is simple and fast, it takes O(nmk) time, where n is the size, m is the degree of the conflict graph, and k is the number of control steps of a scheduled data flow graph. Experimental results on an elliptic filter benchmark design example show that the same quality of results as those obtained from existing synthesis systems can be obtained, with a significant improvement in the CPU time. The RFA heuristic is generic and can be applied to other coloring problems in high-level synthesis. © 1998 Elsevier Science B.V.

Keywords: Data-path; Synthesis; Allocation; Register-file

1. Introduct ion

High-level synthesis is a design automation process that generates a register transfer level (RTL) hardware from a behavioral description of a digital system. High-level synthesis is commonly achieved by dividing the task into a data path design and a control unit design. Data path synthesis consists of two major tasks: scheduling and allo- cation. Scheduling binds operations to appropriate control steps and determines the resource requirement. Allocation is divided into three subtasks: functional units (FUs) alloca- tion, register allocation and interconnection allocation [1]. Register allocation is an important step in data path synth- esis. The starting point for register allocation is lifetimes of variables achieved from a scheduled data flow graph. A variable is said to be alive between the time it is generated and the last use of it. The register allocation binds variables to registers (register files) in such a way that the lifetimes of variables bound to each register will not overlap, with the objective of minimizing the number of registers used. Variables can be allocated to register files to reduce the interconnection and to have more structured design.

*Corresponding author. Tel: (965) 4811188, ext. 5826; Fax: (965) 4817451 ; e-mail: [email protected]

0141-9331/98/$19.00 © 1998 Elsevier Science B.V. All rights reserved PH S 0 1 4 1 - 9 3 3 1 ( 9 8 ) 0 0 0 4 7 - 7

Several different approaches, which use single port register files to reduce interconnections, have been proposed to solve the allocation problem. FACET [2] used a heuristic clique-partitioning technique to group variables into registers, which are then grouped into register files based on the disjoint access time. SPAID [3,4] maps the problem of grouping registers to register files to an edge-coloring problem on a bipartite multigraph. Grant and Denyer [5] first construct a square-graph between variables to record the access clash between variables of the same type of func- tional unit. Then use a hardest first heuristic to map vari- ables to single port register files. EASY [6] uses a maximal clique-partitioning approach to group registers into register files based on disjoint access time. The weight assigned to each edge is the number of times during the execution of the algorithm that values to be stored in the registers have a common origin or destination. ESC [7] first builds a state graph from a scheduled data flow graph. In a state graph a node is created for each control step and an edge is drawn fbr each storage value between its write control step and its read control step. Then it uses a general edge-coloring tech- nique on the state graph to group variables into a minimum number of register files. The problem of multiple-reads is resolved by parallel-reads or serial reads in the graph model. FLORA [8] and STAR [9] use a branch and bound techni- que to solve the allocation problem. Gebotys [10] models

Page 2: Register-file allocation via graph coloring

524 F. Ali, 1. Ahmad/Microprocessors and Microsystems 21 (1998) 523-532

the problem as a linear integer programming problem. Chen and Sobelman [11 ] used a 0-1 matrix partitioning technique to group registers to register files. Other approaches have been implemented in systems like Cathedral-II [12], HYPER [13] and PARBUS [14].

In this paper, we are concerned about only foreground memory management [7], not about background memory management [ 12,18]. In most previous approaches register file allocation is divided into two basic steps: variable grouping and register allocation followed by interconnec- tion allocation. In variable grouping, variables are grouped according to their read and write times such that each group can share the same register file. After value grouping, the registers are allocated to each register file separately. In this paper, we use a graph coloring technique to bind variables to a minimum number of register files and at the same time minimize the size of each register file and allocate interconnections simultaneously. The proposed heuristic is very fast, generic in nature and can be applied to other coloring problems in high-level synthesis. The remainder of the paper is organized as follows: in Section 2 basic definitions and problem formulation are given. An overview of graph coloring algorithms in software com- pilers is given in Section 3. Details of the presented heuristic are given in Section 4. Experimental results are reported in Section 5 and conclusions and future work are presented in Section 6.

2. Problem formulation

2.2. Definition 2

For every cyclic variable v, let P(v) be the last read pro- ceeding the control step in which v is written. Therefore if r'(v) ---- {c E r(v)]c < w(v)}, then P(v) = max r'(v). The lifetime cycle for the cyclic variable v is [1, P(v)] u [w(v), K], and v is called alive in this cycle.

2.3. Definition 3

The conflict graph Go(V, Ef) is defined where (vi, vj) E E f if both variables vi, vj are to be read or written at the same control step. Thus (vi, vj) E E f if

(r(vi) O r(vj)) v (w(vi) n w(vj)) ~ 0

2.4. Definition 4

A proper coloring of the graph is assigning a color to each node such that no two connected nodes have the same color.

Throughout this paper we will assume a two-phase clock- ing system. Data transfer from register files to functional units (FUs) is executed in the first phase while data transfer from FUs back to register files take place in the second phase. Therefore, no conflict will arise between reading a variable and writing another variable in the same control step.

As a walk-through example, we will adopt the scheduled data flow graph as given in Fig. l(a). Fig. l(b) shows the corresponding conflict graph for this example.

Given a set V = {vl, v2, " ' , vn} of variables of a sched- uled data flow graph and a set RF = {RFI, RF2, ..., RFr} of single-port register files, the objective is to find a mapping function F: V --* RF such that the number of register files, the size of each register file and the interconnects between register files and functional units are minimized. It is assumed that scheduling and functional unit allocation have been done for a scheduling length of K control steps. The mapping function F should guarantee that all variables grouped in a single register file have disjoint read and write times. This means that they are never accessed simulta- neously and therefore can share interconnections and ports. Moreover, the mapping function should guarantee that all variables mapped to the same register in a register file have non-overlapping lifetime intervals. For each variable v that belongs to V, let w(v) be the control step in which v is written and r(v) the set of control steps in which v is read.

2.1. Definition 1

For a non-cyclic variable v, the function L(v) denotes that the last control step v is read, thus L(v) = max r(v). The half- open interval [w(v), L(v)] is called the lifetime interval for v, and v is called alive in this interval.

3. Overview of graph coloring in register allocation

In this research, we are using the well-known graph coloring algorithm introduced by Chaitin [16]. This algo- rithm was designed originally for the register allocation phase in optimizing compilers, where the problem was for- mulated as coloring the interference graph, a graph built from the Virtual Assembly Language, with R colors, where R is the number of physical registers in the target machine. Chaitin's algorithm divides the coloring process into two phases:

1. Simplify phase, where the graph is reduced by removing all the nodes of degree less than R. Each time a node is removed, all its edges are disconnected. This process is repeated until the graph is empty.

2. Coloring phase, now the nodes are added back in reverse order and each node is given a color (from 1 to R) which is different from the colors of its neighbors.

Chaitin's work became popular because it is simple and fast, taking time to be linear with the size of the graph. The coloring algorithm described has to be modified in order to be used for register file allocation purposes in the following aspects:

Page 3: Register-file allocation via graph coloring

F. Ali, L Ahmad/Microprocessors and Microsystems 21 (1998) 523-532 525

t o t I t 2 t 3 t4 t 5

2 adl a~

3

3 4

m2

5 m3

ad 9 6

®

Co)

(a)

Fig. 1. (a) A scheduled data flow graph; (b) conflict graph.

1. In the register file allocation problem we do not know the number of colors off hand, therefore our problem is stated as coloring the conflict graph with a minimum number of colors. Consequently, we will always be able to color the conflict graph, a situation which is not guaranteed when coloring the interference graph in soft- ware compilers.

2. More attention is needed for color selection in our approach, since color selection has a strong impact on colorability, and thus on minimizing the number of reg- ister files. Assigning a variable to a particular register file may increase the number of registers depending on the alive variables in the register file during the lifetime interval of the variable to be mapped. Moreover, the mapping function will greatly influence the amount of the interconnects needed for data transfer. Therefore, a careful coloring strategy should be adopted that will not introduce any unnecessary extra cost of hardware (registers and interconnections).

4. Register files and interconnection allocation

4.1. Allocation o f register files

The coloring algorithm for register file allocation will also use two phases: simplification and coloring. The entire coloring process will take a single pass only. The major deviation in our work from global register allocation will be in the coloring phase.

The simplification phase is almost the same as that discussed by Briggs et al. [17] which is again based on Chaitin's work with some improvement. It can be summarized as follows:

1. Represent the nodes of the conflict graph in an array N, where N[j] is the first element of a linked list of nodes that have a degree o f j (i.e. have j neighbors). If there are no such nodes, N[j] is null. Represent the edges of the graph by an adjacency list.

2. At each stage, start from the first non-null element N[i], remove the node at the head of the list N[i] and push it onto a temporary stack; also move each of its neighbors one cell down in N. This reflects disconnecting the edges between the removed node and its neighbors in the graph and, therefore, the degree of each neighbor node will be decremented by one. Step 2 will be repeated until N is empty.

Fig. 2 shows a possible initialization of N and stack contents; and changes after the first two iterations for the example shown in Fig. 1. The sequence of the node removals in this case would be: t 2, t 3, ad l, t 4, t 5, ad2, to, tl, ad3, ml, m2, m3, ad9, Out, as shown in Fig. 2 (stack contents). The search in step (2) will not take more than the sum of the degrees of the nodes, which is exactly equal twice the number of the edges of the graph. Therefore, the total cost of the simplification phase is linear in the size of the graph [17].

For the coloring phase, Algorithm 1 is developed as shown in Fig. 3. Here we start by marking all the nodes of the conflict graph as being uncolored. The initial set of the colors is the empty set. At each stage, a node v is removed off the stack, and given the first encountered color (register file) RFi that satisfies the following:

1. The color is not used by any neighbor node of v. 2. The register file RFi can accommodate v without increas-

ing the number of its registers.

Condition (1) ensures that the conflict graph is properly

Page 4: Register-file allocation via graph coloring

526 F. Ali, 1. Ahmad/Microprocessors and Microsystems 21 (1998) 523-532

N[0] ~ Initial N

NIU

Initially empty stack

N after first iteration N[0]

N[I] ~ 1 - 7 7 - ] - ~

N[2] [ ~ 4 b ~ ~ d ~ 4 ~ [ ~ ~ ~ " ]

Stack contents

U N[0] ~ N after second iteration

Stack contents

N[0] N[I] N[2]

After 14th iteration N is empty

Stack contents

ad2 tO tl ad 3 m 1 m2

Out

Fig. 2. Linked list representation and stack contents for the example given in

colored, therefore it has to be always satisfied. We will identify the set of colors which is not used by the neighbors of node v as the set of missing colors {MRF}. If we run out of colors during the coloring process, then a new color has to be introduced to the set of available colors. The maximum number of alive variables in a control step gives the number of registers [15] in a register file. Condition (2) is applied by examining the function: Density (RFi, k), which gives the number of alive variables in register file i at control step k. If the value of the function is less than the size of RFi through- out the lifetime intervals of v, then the node can be mapped into this register file without any extra hardware cost. There- fore, the algorithm makes no distinction between allocating cyclic and non-cyclic variables, the main concern is the relation between the density and the size of the register file during all the control steps in which v is alive. If this condition (2) cannot be satisfied, then any color that satisfies condition (1) will be selected even though it will increase the size of the selected register file. This coloring process is repeated until we color all the nodes of the graph.

Fig. 1.

For the graph of Fig. 2, first of all node t2 is popped off the stack and is assigned the first color (RFo), similarly other nodes are popped off in a sequence and are assigned colors according to allocation criteria (1) and (2). After applying the heuristic the mappings are RFo = {t2, t4, tl, ad3, m2, m3, Out}, RFt = {t3, t5, ml, ad9 }, and RF2 = {adl, ad2, to). The nodes are colored in the reverse order of the removal sequence. The final result shows that the variables have been mapped into three register files with a total of six registers.

4.1.1. Definition 3 The degree of a graph G(V, E), is the max{dill --< i --< n},

where n is the number of nodes and di is the degree of the node i in the graph.

Since our approach is based on Chaitin's work, it can be easily seen that a conflict graph can be colored with at most m + 1 colors, where m is the degree of the graph. The complexity of Algorithm 1 is O(n.(m + (m + l)k)), since n nodes have to be colored, and for each node we examine

Page 5: Register-file allocation via graph coloring

F. Ali, L Ahmad/Microprocessors and Microsystems 21 (1998) 523-532

all the neighbors (at most m) to find the missing colors. The search among the missing colors to find the first color that does not increase the cost of the hardware may extend to (m + 1).k steps. Therefore, the complexity of Algorithm 1 is O( nmk ).

4.2. Interconnection allocation

The interconnection allocation assigns interconnections units such as multiplexers and tristate buffers that enable data transfers between register files and functional units. The goal is to minimize the number of switches (multiplexer inputs and tristate buffers). An optimal data path allocation technique should combine storage allocation, interconnec- tion allocation and functional units allocation. In this sec- tion, we will show how the interconnection allocation can be efficiently added to the register files allocation technique discussed in the previous section without increasing the complexity of the algorithm. Our focus will be on one bus per register file architecture, similar to the one used by SPAID [2,3].

Algorithm 1: Coloring Heuristic

Procedure min-size-color( )

{~} = 0; n_color = O;

for j = 1, n color(vj) = uncolored;

while stack not empty

v = pop-stack ( );

527

In our approach, switches will be attached to buses incrementally as we color the nodes. During the coloring process of each node, we tentatively allocate the node to all the candidate colors in {MRF}, and compute the needed additional number of switches for each color. We select the color which requires the least number of additional switches for transferring the variable between the register files and the functional units. Ties in the number of switches are broken by selecting the color with the best storage allocation~ If there are still more ties, then random selection is made. Eventually, when a color is selected, all required interconnections for transferring the variable to and from the functional units will be added to the circuit. So to consider interconnection allocation in the coloring algorithm previously shown in Fig. 3, all what we need to do is to compute the number of additional switches (multiplexer inputs + tristate buffers) for each element in {MRF}. Colors which resulted in a minimal additional switches are retained in {MRF}, while the other colors are discarded. The rest of the algorithms remains as it is.

{MRF} = {RFI - {colors used by neighbors of v };/* find the missing color */

if {MRF} = 0 then

n color = n_color + 1; /* add a new color */

{RF} = {RF} u {RFn_color};

color (v) = RFn_colo r ;

size (RFn_color) = 1;

else RF i = I'st element of {MRF};

Found = False; L=I;

While not (found) & L _< IMRFI

found = True;

for k = 1, K-control-steps

if (v is alive) and (Density (RFi, k) = size (RF i )) then

found = False;

endif;

if not (found) then RF i = next-element of {MRF}; L++;

endif;

endwhile;

if not (found) then

RF i = element of {MRF}; size (RF i )++;

color(v) = RF i ;

endif;

endif;

endwhile;

end Procedure.

Fig. 3. Coloring heuristic for allocation of register files.

Page 6: Register-file allocation via graph coloring

528 F. Ali, 1. Ahmad/Microprocessors and Microsystems 21 (1998) 523-532

RF2 RF1 RFO

T C o n s t a n t s

ml is popped of the stack

Stack contents ~

• Multiplexer input

0 Tristate buffer

Fig. 4. The partial circuit for the example in Fig. 1 when node m t is popped out of the stack.

Fig. 4 shows a stage in the data path synthesis for the example of Fig. 1. The circuit shown was achieved after allocating the nine top elements of the stack. For allocating the next element, variable rnl, we have MRF = {RF o, RFI}. Each one of these colors requires one additional switch to transfer m i between the register file and the functional unit. Since RFo can accommodate m t without any increase in the size of the register file, then it can be selected. Conse- quently, a tristate buffer is to be inserted from the output of the multiplier to the bus of RFo. This was the missing switch. Fig. 5 shows the final synthesized data path for this

RF2

aa 1

ad 2

ad 3

RF1 RF0

t 5 t2

tl t4

m2 to

ad 9 ml c~,, m3

()--

• Multiplexer input O Tristate buffer

Fig. 5. Final synthesized data path for the example in Fig. 1.

example. The variables have been mapped into three register files with a total of seven registers. Comparing this with the results obtained when only storage allocation was considered, we find that one extra register has been added as an expense for reducing the interconnections.

5. Experimental results

The proposed coloring heuristic, designated as RFA, has been implemented in C language on IBM 486 SX machine. The target architecture is similar to the SPAID [3] with one bus per register file and uses a two-phase clocking scheme. We are reporting results from three design examples: the elliptical wave filter (EWF), the auto regression (AR) filter and the 16-point finite impulse response (FIR) filter to demonstrate the effectiveness of the RFA coloring heuristic. The results for the EWF have been compared with SPAID [3], ESC [7] and STAR [9] to show the feasibility of the RFA coloring heuristic. The RFA coloring heuristic is very fast and compares favorably over other synthesis techniques.

Table 1 Comparison of results with ESC

C-steps RFA ESC [7]

RFs Registers RFs Registers

17 5 13 7 16 19 5 14 6 15 28 3 11 5 17 29 3 14 5 14 30 3 13 5 14

Page 7: Register-file allocation via graph coloring

F. Ali, L Ahmad/Microprocessors and Microsystems 21 (1998) 523-532

Table 2 Comparison of results for elliptical wave filter example

529

RFA SPAID [3] STAR [9]

C-steps 17 19 28 29 30 17 19 29 17 Multipliers 1 1 1 1 1 2 1 1 1 Adders 2 2 1 1 1 3 2 1 2 ROMs 1 1 1 1 1 1 1 1 1 Register files 5 5 3 3 3 6 5 4 5 Registers 13 14 11 14 13 17 19 19 14 MUX inputs 21 19 10 9 10 26(32) 20(25) 9(13) 17 Tristate buffers 14 15 8 8 9 N/A N/A N/A 14 Run time Less than a second on IBM 486 machine Order of minutes on SUN 3/160

(includes scheduling time)

19 28 1 1

2 1 1 1

4 3 13 12 17 9 16 8

Order of minutes on SUN 3/60

I

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

t2 t13 t18 t26 x15

Fig. 6. EWF scheduled into 17 control steps.

t39 out

Page 8: Register-file allocation via graph coloring

530 F. Ali, 1. Ahmad/Microprocessors and Microsystems 21 (1998) 523-532

I--.,- ,,-tp-

l..q~

- I¢-

• Multiplexer input

O Tristate buffer

Fig. 7. A synthesized data path for EWF.

5.1. Elliptical wave filter example

The EWF consists of 34 operations (eight multiplications and 26 additions) and it is the most extensively studied benchmark example in high-level synthesis. We have gen- erated different schedules for the EWF using non-pipelined adders and a pipelined multiplier. We are making two kinds of comparisons: one without interconnection allocation and the other with interconnection allocation. For without inter- connection allocation, we compare our results with ESC [7]. ESC first builds a state graph from a scheduled data flow graph. In a state graph, a node is created for each control step and an edge is drawn for each storage value between its write control step and its read control step. Then it uses a general edge-coloring technique on the state graph to group variables into a minimum number of register files. The problem of multiple-reads is resolved by parallel-reads or serial-reads in the graph model. ESC is independent of the number of functional units used and depends only on the number of control steps. The comparison of results with ESC is reported in Table 1. The results for ESC are taken from p. 245 (Fig. 7) from ref. [7], which also uses a two- phase clocking scheme. The RFA technique results in both fewer register files and registers in almost all cases.

Table 3 Allocation results for the AR and FIR filter

FIR filter AR filter

C-steps 10 11 15 14 20 Multipliers 2 1 1 3 1 Adders 3 2 1 3 1 Register files 7 5 3 9 4 Registers 17 16 16 17 17 MUX inputs 21 13 8 37 14 Tristate buffers 14 12 8 23 10

With interconnection allocation Table 2 illustrates the results of RFA against SPAID [2,3], and STAR [9]. We compute the number of multiplexer inputs by summing the number of register files to the points of interconnection from buses to the inputs of functional units. We compute the number of tristate buffers by summing the number of reg- ister files to the number of interconnection from the outputs of the functional units to the buses. Since our calculations are different from SPAID (it did not count the number of multiplexer inputs between the register files and buses, which is equivalent to the number of register files in the SPAID architecture), to have a fair comparison, we recalculate the results of SPAID and put them in parentheses in Table 2. STAR uses the same criteria as ours to compute the number of multiplexer inputs and tristate buffers. It is clear that our approach compares favorably over other synthesis techniques on this benchmark example. STAR uses a different interconnection topology, it does not use one bus per register file, which gives it freedom to exploit the data transfers. The RFA coloring technique requires fewer register files, registers and interconnections as compared with the SPAID. The results are also compared favorably with STAR. The RFA coloring technique has a few more multiplexer inputs but the number of registers is smaller or at least the same. The significant improvement is in the CPU time. The RFA heuristic produces almost the same quality results in less than 1 sec on an IBM 486 SX machine, while both STAR and SPAID take of the order of minutes on SUN 3/60 and SUN 3/160 machines, respect- ively. A scheduled and functional units allocated DFG for EWF is shown in Fig. 6 and the synthesized data path is shown in Fig. 7.

5.2. FIR and AR filter example

The FIR filter [20] consists of 23 operations (15 additions and eight multiplications) and the AR filter [19] consists of 28 operations (10 additions and 18 multiplications). Differ- ent schedules are generated using pipelined multipliers and non-pipelined adders. Since noone has reported the results previously for register files based architectures, we are just reporting the data path allocation results for these filters in Table 3. We are showing the complete data path result for FIR filter for 15 control step schedule. The 15 control step schedule for FIR filter is given in Fig. 8 and the data path is given in Fig. 9.

6. Conclusions and future work

In this paper, we presented a polynomial time graph coloring heuristic, designated as RFA, to perform variable grouping, register allocation and interconnection alloca- tion simultaneously. SPAID and ESC both use coloring heuristics for allocating the register files. In this regard,

Page 9: Register-file allocation via graph coloring

to tl t2

ad

F. Ali, 1. Ahmad/Microprocessors and Microsystems 21 (1998) 523-532

t3 t4 L5 t6 t7 t8 ~ rio Ill f19 tlq 11a i1~

531

10

12

• Multiplexer input

O Tristate buffer

Fig. 9. A synthesized data path for FIR filter.

References

13

14

15

Fig. 8. FIR filter scheduled into 15 control steps.

the RFA technique outperforms both these techniques and gives fewer register files. Regarding the interconnection allocation results, the RFA technique results are compared favorably with the other techniques. The significant improvement was in the CPU time from the order of minutes to less than a second to produce almost the same quality results compared with the existing techniques. We need to perform functional units allocation in order to improve the interconnections. Moreover, internal forwarding will minimize the storage requirement.

Acknowledgements

The authors would like to thank the Kuwait University for providing the financial support for this work under research grant no. EPE-060. Thanks are also due to the reviewers for their constructive comments to improve the quality of this manuscript.

[1] D.D. Gajski, N.D. Dutt, A.C. Wu, S.Y. Lin, High-Level Synthesis: Introduction to Chip and System Design, Kluwer Academic, Dor- drecht, 1992.

[2] C.J. Tseng, D.P. Siewiorek, Automated synthesis on data path in digital systems, IEEE Trans. Computer-Aided Des. CAD-5 (3) (1986) 379-395.

[3] B.S. Haroun, M.I. Elmasry, Architectural synthesis for DSP silicon compiler, IEEE Trans. Computer-Aided Des. CAD-8 (4) (1989) 431- 447.

[4] B.S. Haroun, M.I. Elmasry, SPAID: an architectural synthesis tool for DSP custom application, IEEE J. Solid-State Circuits 24 (2) (1989) 426-435.

[5] D.M. Grant, P.B. Denyer, Memory, control and communication synth- esis for scheduled algorithms, in: Proceedings of the 27th ACMflEEE Design Automation Conference, June 1990, pp. 162-167.

[6] L. Stok, V.D. Born, EASY: multiprocessor architecture optimization, in: G. Saucier, P.M. McLellan, Proceedings of the International Work- shop on Logic and Architecture Synthesis for Silicon Compilers, May 1988, pp. 313-328.

[7] L. Stok, J.A.G. Jess, Foreground memory management in data path synthesis, Int. J. Circuit Theory Applic. 20 (1992) 235-255.

[8] T.Y. Liu, Y.L. Lin, FLORA: a data path allocator based on branch- and-bound search, J. VLSI Integr. 11 (1) (1991) 43-66.

[9] F.S. Tsai, Y.C. Hsu, Data path construction and refinement, in: IEEE International Conference on Computer-Aided Design, ICCAD-90, November 1990, pp. 308-31 I.

[10] C.H. Gebotys, Synthesizing optimal register file architecture for FPGA technology, in: Proceedings of the IEEE 1994 Custom Inte- grated Circuits Conference, 1994, pp. 233-236.

[11] C.H. Chen, G.E. Sobelman, Single port/multiport memory synthesis in data path design, in: IEEE International Symposium on Circuits and Systems, vol. 2, May 1990, pp. 1110-1113.

[12] I. Verbauwhede, F. Catthoor, J. Vandewalle, H.D. Man, Background memory synthesis for algebraic algorithms in multi processor DSP chips, in: VLSI 89, 1989, pp. 209-218.

[13] M. Patkonjak, J. Rabey, A scheduling and resource allocation

Page 10: Register-file allocation via graph coloring

532 F. Ali, L Ahmad/Microprocessors and Microsystems 21 (1998) 523-532

algorithm for hierarchical signal flow graph, in: Proceedings of the 26th ACM/IEEE Design Automation Conference, June 1989, pp. 7-12.

[14] C. Ewering, Automatic high level synthesis of partitioned buses, in: Proceedings of the IEEE International Conference on Computer- Aided Design, ICCAD-90, November 1990, pp. 304-307.

[ l 5] A. Hashimoto, J. Stevens, Wire routing by optimizing channel assign- ment within large aperture, in: 8th Design Automation Workshop, 1971, pp. 155-159.

[16] G.J. Chaitin, Register allocation and spilling via graph coloring, in: Proceedings of SIGPLAN'82 Symposium on Compiler Construction, SIGPLAN Notices 17(6), June 1982, pp. 98-105.

[17] P. Briggs, K. Cooper, K. Kennedy, L. Torezon, Coloring heuristics for register allocation, in: SIGPLAN'89 Conference on Programming Language Design and Implementation, 1989, pp. 275-284.

[18] F. Balasa, F. Catthoor, H.D. Man, Background memory area estima- tion for multidimensional signal processing systems, IEEE Trans. VLSI systems 3 (2) (1995) 157-172.

[19] R. Jain, A.C. Parker, N. Park, Predicting system-level area and delay for pipelined and non-pipelined designs, IEEE Trans. Computer- Aided Des. CAD-I 1 (8) (1992).

[20] N. Park, A.C. Park, SEHWA: a software package for synthesis of pipelines from behavioral specifications, IEEE Trans. Computer- Aided Des. CAD-7 (3) (1988) 356-370.

. . . . . Imtiaz Ahmad received his B.Sc. in Electrical Engineering from the University to Engineering and Technology, Lahore, Pakistan, an M.Sc. in Electrical Engineering from King Fahd Univer- sity of Petroleum and Minerals, Dhahran, Saudi Arabia, and a Ph.D. from Syracuse University, New York, in 1984, 1988and 1992, respectively. Since September 1992, he has been with the department of Electrical and Computer

'~.~ ;~L" Engineering at Kuwait University, where he is currently an associate professor. His research interests include design automation of digital

syatems, high-level synthesis, and parallel processing.

Faridah Ali is an assistant professor in the department of electrical and computer engineering at Kuwait University. Her research interests are in the area of high-level synthesis and computer architecture. She received her B.Sc. from Kuwait University in 1980, M.S.E. from University of Michigan, Ann Arbor in 1983, and Ph.D. from Virginia Polytechnic Institute in 1988, all in electrical engineering.