IEEE TRANSACTIONS ON COMPUTER-AIDED …csg.csail.mit.edu/pubs/memos/Hoe/ArvindHoe0904IEEE.pdfIEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 23,

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2004 1277

Operation-Centric HardwareDescription and SynthesisJames C. Hoe, Member, IEEE, and Arvind, Fellow, IEEE

Abstract—The operation-centric hardware abstraction is usefulfor describing systems whose behavior exhibits a high degreeof concurrency. In the operation-centric style, the behavior ofa system is described as a collection of operations on a set ofstate elements. Each operation is specified as a predicate and aset of simultaneous state-element updates, which may only takeeffect in case the predicate is true on the current state values.The effect of an operation’s state updates is atomic, that is, thelegal behaviors of the system constitute some sequential inter-leaving of the operations. This atomic and sequential executionsemantics permits each operation to be formulated as if the restof the system were frozen and thus simplifies the description ofconcurrent systems. This paper presents an approach to syn-thesize an efficient synchronous digital implementation from anoperation-centric hardware-design description. The resultingimplementation carries out multiple operations per clock cycleand yet maintains the semantics that is consistent with the atomicand sequential execution of operations. The paper defines, andthen gives algorithms to identify, conflict-free and sequentiallycomposable operations that can be performed in the same clockcycle. The paper further gives an algorithm to generate the hard-wired arbitration logic to coordinate the concurrent execution ofconflict-free and sequentially composable operations. Lastly, thepaper evaluates synthesis results based on the TRAC compiler forthe TRSPEC operation-centric hardware-description language.The results from a pipelined processor example show that anoperation-centric framework offers a significant reduction indesign time, while achieving comparable implementation qualityas traditional register-transfer-level design flows.

Index Terms—Conflict-free, high-level synthesis, operation-cen-tric, sequentially composable, term-rewriting systems (TRS).

I. INTRODUCTION

MOST hardware-description frameworks, whetherschematic or textual, make use of concurrent state ma-

chines (or processes) as their underlying computation model.For example, in a register-transfer-level (RTL) language,the design entities are the individual registers (or other stateprimitives) and their corresponding next-state equations. Werefer to this style of descriptions as state-centric because thedescription of behavior is organized by the individual state

Manuscript received March 11, 2003; revised December 19, 2003. This workwas supported in part by the Defense Advanced Research Projects Agency, De-partment of Defense, under Ft. Huachuca Contract DABT63-95-C-0150 and inpart by the Intel Corporation. J. C. Hoe was supported in part by an Intel Foun-dation Graduate Fellowship during this research. This paper was recommendedby Associate Editor R. Gupta.

J. C. Hoe is with the Department of Electrical and Computer Engi-neering, Carnegie Mellon University, Pittsburgh, PA 15213 USA (e-mail:[email protected]).

Arvind is with Computer Science and Artificial Intelligence Laboratory,Massachusetts Institute of Technology, Cambridge, MA 02139-4307 USA(e-mail: [email protected]).

Digital Object Identifier 10.1109/TCAD.2004.833614

element’s next-state logic. Alternatively, the behavior of ahardware system can be described as a collection of opera-tions, where each operation can atomically update all or someof the state elements. In this paper, we refer to this style ofdescriptions as operation-centric because the description ofbehavior is organized into individual operations. Our notionof operation-centric models for hardware description is in-spired by the term rewriting systems (TRS) formalism [1].The operation-centric hardware model of computation is alsoanalogous to Dijkstra’s guarded commands [7]. Similar modelsof computation can be found in some parallel programminglanguages (e.g., UNITY [5]), hardware-description languagesfor synchronous and asynchronous design synthesis (e.g.,synchronized transitions (ST) [13], [16], [17]), and languagesfor hardware-design verification (e.g., st2fl [12], SINC [14]).In Section VII, we compare our description and synthesisapproach to synchronized transitions.

The operation-centric model of computation consists of a setof state variables and a set of state transition rules. Each atomictransition rule represents one operation. An execution is inter-preted as a sequence of atomic applications of the state transi-tion rules. A rule can be optionally guarded by a predicate con-dition such that a rule is applicable only if the system’s statesatisfies the predicate condition. If several rules are enabled bythe same state, any one of the enabled rules can be nondeter-ministically selected to update the state in one step, and after-wards, a new step begins on the updated new state. With pred-icated/guarded transition rules, an execution is interpreted as asequence of atomic rule applications such that each rule applica-tion produces a state that satisfies the predicate condition of thenext rule to be applied. The atomicity of operation-centric statetransition rules simplifies the task of hardware description bypermitting the designer to formulate each rule/operation underthe assumption that the system is not simultaneously affectedby other potentially conflicting operations—the designer doesnot have to worry about race conditions between different op-erations. This permits an apparently sequential description ofpotentially concurrent hardware behaviors.

It is important to note that the simplifying atomic and se-quential semantics of operations does not prevent a legal im-plementation from executing several operations concurrently.Implicit parallelism between operations can be exploited by anoptimizing compiler that produces a synchronous implemen-tation that executes several operations concurrently in a clockcycle. However, the resulting concurrent implementation mustnot introduce new behaviors, that is, behaviors which are notproducible under the atomic and sequential execution seman-tics. This paper presents an approach to synthesize such a syn-

0278-0070/04$20.00 © 2004 IEEE

1278 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2004

chronous hardware implementation from an operation-centricdescription. The paper provides algorithms for detecting oppor-tunities for the concurrent execution of operations and for gener-ating a hardwired arbitration logic to coordinate the concurrentexecution in an implementation.

This paper is organized as follows. Section II first presentsthe abstract transition systems (ATS), an simple operation-cen-tric hardware formalism used to develop and explain the syn-thesis procedures. Section III next describes the basic synthesisprocedure to create a reference implementation that is function-ally correct but inefficient. The inefficiencies of the referenceimplementation are addressed in Sections IV and V using twooptimizations based on the concurrent execution of conflict-freeand sequentially composable operations, respectively. The re-sulting optimized implementation incorporates a hardwired ar-bitration logic that enables the concurrent execution of con-flict-free and sequentially composable operations. Section VIpresents a comparison of designs synthesized from an opera-tion-centric description versus an state-centric RTL description.Section VII discusses related work, and Section VIII concludeswith a summary.

II. OPERATION-CENTRIC TRANSITION SYSTEMS

We have developed an source-level language called TRSPECto support operation-centric hardware design specification[9]. TRSPEC’s syntax and semantics are adaptations of awell-known formalism, TRS [1]. To avoid the complicationsof a source language, instead, we present a low-level opera-tion-centric hardware formalism called ATS. ATS is developedas the intermediate representation in our TRSPEC compiler.In this paper, we use ATS to explain how to synthesize anefficient implementation from an operation-centric hardwarespecification.

A. ATS Overview

The structure of ATS is summarized in Fig. 1. At the top level,an ATS is defined as a triple . is a list of explicitlydeclared state elements, including registers , arrays andfirst-in-first-out (FIFO) queues . is a list of initial valuesfor the elements in . is a set of operation-centric transi-tions, where each transition is a pair . In a transition, is aboolean predicate expression, and is a list of simultaneous ac-tions, exactly one action for each state element in . (An accept-able action for all state-element types is the null action “ .”)In an ATS, if of is enabled in a state [abbreviated as

, or simply ], then the simultaneous ac-tions prescribed by are applied atomically to update to

. We use the notation to mean the resulting state . Inother words, is the functional equivalent of .

B. ATS State Elements and Actions

In this paper, we concentrate on the synthesis of ATS withthree types of state elements: , , and . The three types ofstate elements are depicted in Fig. 2 with signals of their sup-ported interfaces. Below, we describe the three types of stateelements and their usage.

Fig. 1. ATS summary.

Fig. 2. Synchronous state primitives.

A register can store an integer value up to a specified max-imum word size. The value stored in a register can be refer-enced using the side-effect-free query and set to usingthe action (written as and , respectively).An entry of an array can be referenced using the side-effect-free

query and set to using the action. Forbrevity, we abbreviate as in our examples.The oldest value in a FIFO can be referenced using the side-ef-fect-free query, and can be removed by the action.A new value can be added to a FIFO using the action.The action is a compound action that conceptually firstdequeues the oldest entry before enqueuing a new entry. In ad-dition, the contents of a FIFO can be cleared using theaction. The status of a FIFO can be queried using the side-ef-fect-free and queries. Conceptually, anATS FIFO has a bounded but unspecified size. The exact sizeis only set at implementation time, and hence a specificationmust be valid for all possible sizes. Applying an to a fullFIFO or applying to an empty FIFO is illegal. (Applying

action to a full FIFO is legal however.) Any transi-tion that queries or acts on a FIFO must include the appropriate

or queries in its predicate condition.

HOE AND ARVIND: OPERATION-CENTRIC HARDWARE DESCRIPTION AND SYNTHESIS 1279

Fig. 3. Simple two-stage pipelined processor example with S =

h ; ; ; ; i.

Besides registers, arrays, and FIFOs, the complete ATS in-cludes register-like state elements for input and output. An inputstate element is like a register but without the action. A

query on an input element returns the value presented on acorresponding external input port. An output state elementsupports both and , and its content is visible to theoutside of the ATS on a corresponding output port. For brevity,we omitted the discussion of and in this paper; for the mostpart, and are treated as during synthesis.

C. Pipelined Processor Example

In this example, we describe a two-stage pipelined processor,where a pipeline buffer is inserted between the fetch and exe-cute stages. In an ATS, we use a FIFO, instead of a register, asthe pipeline buffer. The buffering property of a FIFO providesthe isolation needed to permit the operations in the two stages tobe described independently. Although the resulting descriptionreflects an elastic pipeline, our synthesis can infer a legal imple-mentation that operates synchronously and has stages separatedby registers (explained further in Section VI-A).

The key state elements and datapath of the two-stagepipelined processor example is shown in Fig. 3. The processorconsists of state elements .The five state elements are: the program counter, theregister file (an array of integer values), the instructionmemory (an array of instructions), the data memory (anarray of integer values), and the pipeline buffer (a FIFO ofinstructions). In an actual synthesis, the data width of the stateelements would be inferred from the type declaration given inthe source-level description.

Next, we examine ATS transitions that describe the behaviorof this two-stage pipelined processor. Instruction fetching inthe fetch stage can be described by the transition

. This transition should fetch an instruction fromthe current program location in and enqueuing the instruc-tion into . This transition should be enabled whenever isnot full. In ATS notation

Notice the specification of the transition is unconcernedwith what happens elsewhere in the pipeline (e.g., if an earlier

branch in the pipeline is about to be taken or if the pipeline isabout to encounter an exception).

The execution of the different instruction types in the secondpipeline stage can be similarly described as separate atomictransitions. First, consider for executingan instruction, where

is enabled only when the next pending instruction inis an instruction. When applied, its actions carry out thesemantics of the instruction. Next, consider two separatetransitions and

that specify the two possibleexecutions of a branch-if-zero instruction

tests for conditions when the next pending instruc-tion is a instruction and the branch condition is not met. Theresulting action of is to leave the pipeline state un-modified other than to remove the executed instruction from

. On the other hand, when is satisfied, besides set-ting to the new branch target, must also simulta-neous clear the content of because the transition givenearlier actually follows a simple branch speculation by alwaysincrementing . Thus, if a branch is taken later, could holdzero or more speculatively fetched wrong-path instructions.

In this two-stage pipeline description, in the first stageand a transition in the second stage could become applicable atthe same time. Even though conceptually only one transition isto take place in each step, an implementation of this processordescription must carry out both fetch and execute transitions inthe same clock cycle; otherwise, the implementation does notbehave like a pipeline. Nevertheless, the implementation mustalso ensure that a concurrent execution of multiple transitionsproduces the same result as a sequentialized execution of thesame transitions in some order. In particular, consider the con-current execution of and . Both transitions up-date and . In this case, the implementation has to guar-antee that these transitions are applied in some sequential order.However, it is interesting to note that the choice of ordering de-termines how many bubbles are inserted after a taken branch,but it does not affect the processor’s ability to correctly executea program.


III. SYNTHESIZING A REFERENCE IMPLEMENTATION

This section introduces a basic synthesis procedure that mapsan operation-centric ATS into an state-centric RTL-level rep-resentation. It is assumed that commercial RTL-synthesis toolswould complete the synthesis path to a final physical implemen-tation. The synthesis procedure described in this section pro-duces a straightforward implementation that executes only onetransition per clock cycle. In this synthesis, the elements ofare instantiated from a design library to constitute the state ele-ments in the implementation. The transitions in are combinedto form the next-state logic for the state elements in a three stepprocedure. The resulting reference implementation is function-ally correct but contain many inefficiencies, most particularlyits lack of concurrency due to the single-transition-per-cycle re-striction. This inefficiency is addressed in Sections IV and V bytwo increasingly sophisticated optimizations that enables con-current execution of conflict-free and sequentially composabletransitions.

A. Extraction of and

In this first step, all value expressions in the ATS are mappedto combinational signals evaluated on the current values of thestate elements. In particular, this step creates a set of signals,

, that are the signals of transitions inan -transition ATS. The logic mapping in this step assumes allrequired combinational resources are available. Standard RTLoptimizations can be applied later to simplify the combinationallogic and to share redundant logic.

B. Arbitration Logic

In this step, we create an arbitrator to sequence the executionof enabled transitions. The arbitrator circuit generates the set ofarbitrated enable signals based on .The block diagram of a generic arbitrator is shown in Fig. 4.Any valid arbitrator must, at least, ensure for any :

1) ;2) .

For a single-transition-per-cycle reference implementation, thearbitrator is further constrained to assert at most one signal ineach clock cycle, reflecting the selection of one applicable tran-sition. A priority encoder is a valid arbitrator for the referenceimplementation.

C. Next-State Logic Composition

Lastly, one conceptually begins by creating independentversions of the next-state logic, each corresponding to one of the

transitions in the ATS. Next, the versions of next-statelogic are merged, state-element by state-element, using thesignals for arbitration at each state element. For example, a par-ticular register may have transitions that affect it over time.

because some transitions may not affect the register.The register takes on a new value if any of the relevant tran-sitions is applicable in a clock cycle. Thus, the register’s latchenable is the logical-OR of the signals from the relevanttransitions. The new value of the register is selected from the

Fig. 4. Monolithic arbitrator for anM -transition ATS.

Fig. 5. Circuits for combining two transitions’ actions on the same stateelement.

next-state values via a multiplexer controlled by the signals.Fig. 5 illustrates the merging circuit for a register that can beacted on by two transitions.

This merging scheme assumes at most one transition’s actionis applied to a particular state element in a clock cycle. Further-more, all the actions of a selected transition must be selected inthe same clock cycle at all affected state elements to ensure theappearance of an atomic transition. The two assumptions aboveare trivially met in a single-transitions-per-cycle implementa-tion. The details of the merging circuit for all three ATS stateelement types are given next as RTL equations.

For each , the set of transitions that update iswhere is the action on

by . ’s data and latch enable inputs are

For each , the set of transitions that write is. ’s write address , data , and

enable inputs are

The set of transitions that enqueue a new value into is


The set of transitions that dequeue from is

Similarly, the set of transitions that clear the contents of is

This conceptual construction based on merging separateversions of next-state logic would result in a large amount of du-plicated logic and resources. In reality, our compiler performsextensive common-subexpression elimination and constantpropagation transformations to simplify the synthesized RTLnetlist. Furthermore, in the multiple-transition-per-cycle imple-mentations discussed later, our compiler also analyze whethertwo transitions are mutually-exclusive to enable time-multi-plexed reuse of resources such as arithmetic units and interfaceports to the state elements. The two primary inefficiencies ofthis reference implementation are that: 1) only one transitionsis permitted per cycle and 2) all transitions must be arbitratedthrough a centralized arbitration circuit. These two inefficien-cies are addressed together by the optimizations in the next twosections.

D. Correctness of the Reference Implementation

The reference implementation given above is not strictlyequivalent to the ATS. In the reference implementation, unlessthe arbitrator employs true randomization, the implementationis always deterministic. In other words, the implementationcan only embody one of the behaviors allowed by the ATS. Animplementation is said to implement an ATS correctly if: 1)the implementation’s sequence of state transitions correspondsto some execution of the ATS and 2) the implementationmaintains liveness. Thus, the implementation could enter alivelock if the ATS depends on nondeterminism for forwardprogress. The reference implementation can use a round-robinpriority encoder to ensure weak-fairness, that is, a transition isguaranteed to be selected at least once if it remains applicablefor a bounded number of consecutive cycles.

IV. CONCURRENT EXECUTION OF CONFLICT-FREE

TRANSITIONS

Although the semantics of ATS defines an execution insequential and atomic steps, a hardware implementation canexecute multiple transitions concurrently in one clock cycle.However, in a multiple-transitions-per-cycle implementation,the composite state transition taken in each clock cycle mustcorrespond to some sequentialized execution of the constituentATS transitions. For example, a necessary condition for an im-plementation in state to concurrently execute two applicabletransitions and is that . Inother words, after applying the actions of to , the resulting

intermediate state must still satisfy , or vice versa.Otherwise, concurrent execution of and is not possiblesince there is not a legal sequential execution of and intwo consecutive atomic steps.

There are two approaches to carrying out the effects ofand in the same clock cycle. The first approach adds to thereference implementation in Section III the cascaded combi-national logic of the two transitions, in essence introducing abrand new transition that is the composition of and . How-ever, arbitrary cascading is not always desirable since it leadsto an explosion in circuit size and a longer cycle time. In ourapproach, and are executed in the same clock cycle onlyif the correct final state can be reconstructed from an indepen-dent and parallel evaluation of their combinational logic on thesame starting state. In other words, the resulting multitransitionimplementation should only require minimal new hardware andhave similar (or even shorter) cycle time as compared to the ref-erence implementation. This section first develops an optimiza-tion based on the conflict-free relationship . Section Vnext presents another optimization based on the sequential-com-posability relationship that can expose additional hard-ware concurrency.

A. Conflict-Free Transitions

is a symmetric relationship that imposes a strongerthan necessary requirement for executing two transitions con-currently. However, the symmetry of permits a morestraightforward implementation than . Given and areboth applicable in state , implies that applying

first does not revoke the applicability of , and vice versa(i.e., . further im-plies that the two transitions can be applied in either order intwo successive steps to produce the same final state,

. In order to permit a straightfor-ward implementation, further requires that an im-plementation could produce ’ by applying the parallel com-position of and to the same initial state . The par-allel composition function takes two action lists and

and returns a new action list that is their pair-wise union;is undefined if and performs conflicting

actions on any state element. The conflict-free relationship andthe parallel composition function are defined below formally.

Definition 1—Conflict-Free Relationship: Two transitionsand are said to be conflict-free if and

only if

where is the functional equivalent of .Definition 2—Parallel Composition:


where

case of

undefined

case of

undefined

case of

undefined

Below, Theorem 1 extends the composition of twotransitions to multiple pairwise transitions. Theorem 1states that it is safe to execute in the same cycle, by parallelcomposition, any number of applicable transitions that are allpairwise . Each pair from the transitions selected for con-current execution must be , because is not a tran-sitive. The resulting composite state transition corresponds tothe sequential execution of the selected applicable transitions inany order.

Theorem 1—Composition of Transitions: Given acollection of transitions applicable in state , if all tran-sitions are pairwise , then the following holds for anyordering :

where is the functional equivalent of the parallel composi-tions of , in any order. (A proof for Theorem 1is given in [9].)

B. Static Determination of

The arbitrator-synthesis algorithm given later in this sectionis compatible with a conservative test for , that is, if thetest fails to identify a pair of transitions as , the algorithmgenerates a less optimal but still correct implementation. A con-servative static determination of can be made by com-paring the read-set and write-set of the transitions. The read-setof a transition is the set of state elements in read by the ex-pressions in either or . The write-set of a transition is the setof state elements in that are acted on by . For this analysis,the head and the tail of a FIFO are considered to be separate el-ements. Using and that extracts the read-set

and write-set of a transition, a sufficient condition that ensuresis

In other words, conservatively, two transitions are if theydo not have data (read-after-write) or output (write-after-write)dependence between them.

If two transitions and never become applicable in thesame state (i.e., , then they are said tobe mutually exclusive, . Two transitions that are

also satisfy the definition of . An exact test forrequires determining the satisfiability of the expression

. Fortunately, the expression is usually aconjunction of relational constraints on the current values of thestate elements. A conservative test that scans two expressionsfor contradicting constraints on any one state element workswell in practice.

C. Arbitration of Conflict-Free Transitions

By applying Theorem 1, instead of selecting a single transi-tion per clock cycle, an arbitrator (like the one shown in Fig. 4)can select a number of applicable transitions that are all pair-wise . In other words, in each clock cycle, the signalsshould satisfy the condition

where is the arbitrated enable signal for transition . Givena set of applicable transitions in a clock cycle, many differentsubsets of pairwise transitions could exist. Selecting theoptimum subset would require weighing the relative importanceof the transitions, which is impossible for the arbitrator synthesisalgorithm to discern without user feedback or annotation. Anobjective selection metric is to simply maximize the number oftransitions executed in each clock cycle.

Below, we describe the construction of an efficient arbitratorthat selects multiple pairwise transitions per clock cycle.In a partitioned arbitrator, transitions in are first partitionedinto as many disjoint arbitration groups, , as possiblesuch that

This partitioning ensures that a transition is with anytransition not in the same group and, hence, each arbitrationgroup can be arbitrated independently of other groups. Below,we first describe the formation of groups and then thearbitration within individual groups.

Step 1) -Group Partitioning: can be par-titioned into -arbitration groups by finding theconnected components of an undirected conflict graphwhose nodes are transitions and whose edgesare . Each connected compo-nent in the conflict graph is a -arbitration group.For a given conflict graph, the partitioning into -ar-bitration groups is unique. For example, the undirectedgraph (a) in Fig. 6 depicts the relationships inan ATS with six transitions. Graph (b) in Fig. 6 givesthe corresponding conflict graph. The conflict graph has


Fig. 6. Analysis of conflict-free transitions: (a) Conflict-free graph and(b) Corresponding conflict graph and its connected components.

three connected components, corresponding to the three-arbitration groups.

Step 2) -Group Arbitrator: For a given-arbitration group containing ,

can be generated locally within thegroup from using a priorityencoder. However, additional concurrency within a

-arbitration group is possible.

D. Enumerated Group Arbitrator

For example, Arbitration Group 1 in Fig. 6 contains threetransitions , such that , but neithernor is with . Although the three transitions cannotbe arbitrated independently of each other, and can be se-lected together as long as is not selected in the same clockcycle. In general, for a given state , each group arbitrator can in-dependently select multiple pairwise- transitions withinits arbitration group.

For a -arbitration group with transitionsand can be computed locally by a combinationalfunction that is equivalent to a lookup table indexed by

. The binary data value at thetable entry with binary index can be determined byfinding a maximal clique in an undirected graph whose nodes

and edges are defined as follows:

is asserted

This is the conflict-free graph of the subset that correspondsto the asserted positions in the binary index . For each

that is in the selected clique, assert . For example, Arbitra-tion Group 1 from Fig. 6 can be arbitrated using the enumeratedlookup table in Fig. 7. The lookup table selects and whenthey are applicable together. This lookup table also reflects anarbitrary decision that should be selected instead of ifand are both applicable in the same cycle. A similar decisionis imposed between and . This freedom in -grouparbitrator generation highlights the fact that, whereas the par-titioning of independent arbitration groups is unique,the maximal asserted subset of ’s for a given combination of

Fig. 7. Enumerated arbitration lookup table for Arbitration Group 1 in Fig. 6.

is not unique. In general, the generationof -group arbitrator can benefit from user input in se-lecting the most “productive” subset (which may not be the max-imal subset).

E. Performance Gain

When can be partitioned into arbitration groups, the in-dividual partitioned arbitrators are smaller and faster than thesingle monolithic encoder in the reference implementation fromSection III. The partitioned arbitrators also reduce wiring costand delay since and signals of unrelated transitions are notbrought together for arbitration.

The property of the parallel composition function ensures thattransitions are only if their actions do not conflict atany state element. Hence, the state update logic from the ref-erence implementation can be used unchanged in -arbi-trated implementations. Consequently, combinational delay ofthe next-state logic is not increased by the optimization.All in all, a -arbitrated implementation should achievebetter performance than its corresponding reference implemen-tation by allowing more transitions to execute in a clock cyclewithout increasing the cycle time.

V. CONCURRENT EXECUTION OF SEQUENTIALLY

COMPOSABLE TRANSITIONS

As noted in Section IV-A, the relationship is a suf-ficient but not necessary condition for concurrent executionof multiple transitions in the same clock cycle. This sectionpresents a more exact requirement to extract additional concur-rency from within a -arbitration group.

A. Sequentially Composable Transitions

Consider the following ATS example:

In this ATS, all transitions are always applicable. Furthermore,and its functional equivalent are well-de-

fined for any choice of two transitions and . Nevertheless,the arbitrator proposed in the previous section would notpermit and to execute in the same clock cycle because


in general. A more careful re-ex-amination should reveal that, for all , is always con-sistent with at least one ordering of sequentially executingand . Hence, the concurrent execution of any two transitionsabove can be supported efficiently in an implementation. On theother hand, the concurrent execution of all three transitions in aparallel composition is not possible due to circular data depen-dencies among the three transitions. These considerations arecaptured by the sequential composability relationship .

is a relaxation of . In particular, is not sym-metric. The intuition behind is that concurrent executionof a set of transitions does not need to produce the same resultas all possible sequential execution of those transitions, just onesuch sequence. Hence, given and that are applicable instate , only requires the concurrent execution ofand on to correspond to , but not necessarily to

. Similarly, only impliesand does not require .

Concurrent execution of sequentially composable transitionsrequires an asymmetric composition function to combinethe action lists from two transitions. Like , returnsa new action list by composing actions on the same elementfrom two action lists. The state-specific composition functions

, and are the same as , andexcept in two cases where allows conflicting actions tobe sequentialized. First, issince the effect of the first action is overwritten by the second ina sequential application. Second, returnssince regardless of , applying leaves the FIFO emptied.

Below, the sequential composability relationship and the se-quential composition function are defined formally.

Definition 3—Sequential Composability: Two transitionsand are said to be sequentially composable ifand only if

where is the functional equivalent of .Definition 4—Sequential Composition:

where

case of

case of

undefined

case of

undefined

Based on the above definitions, a sufficient condition foris

is defined

where and extract the read-set and write-set ofa transition as explained in Section IV-B. Notice, unlike in theconservative condition for , permits

to be nonempty as long as is defined.Like in the discussion of , the relationship for two

transitions can be extended to enable multiple transitions to becomposed. More specifically, Theorem 2 below extends sequen-tial composition to multiple transitions that are all pair-wise

and whose transitive closure on is ordered. The or-dering requirement is necessary to ensure the selected transi-tions are not circularly dependent as described in the openingparagraphs of this section.

Theorem 2—Composition of Transitions: Given a se-quence of transitions, , that are all applicable instate , if for all , then

where is the functional equivalent of the nested sequentialcomposition . (A prooffor Theorem 2 is given in [9].)

B. Arbitration of Transitions

By applying Theorem 2, the arbitrator of a -arbitrationgroup (from Step 2, Section IV-C) can select in each state theset of applicable transitions such that there exists an ordering ofthose transitions , where if .

Similar to the formation of the -arbitration group inSection IV-C, the construction of a partitioned arbitratorinvolves finding arbitration subgroups as the connectedcomponents in the further relaxed (with fewer edges) conflictgraph of a -arbitration group. In the conflict graph for

analysis, the nodes are , and the edges are


Fig. 8. Analysis of sequentially composable transitions. (a) Directed <graph. (b) Corresponding acyclic directed < graph. (c) correspondingconflict graph and its connected components.

where is

Two nodes are connected if their corresponding transitions arerelated by neither nor . As an added complication,

considered in this analysis must be acyclic to satisfy theSC-ordering requirement in Theorem 2 (to avoid the concur-rent execution of circularly dependent transitions). Therefore,the conflict graph must be based on a version of that hasbeen made acyclic by removing cycle-inducing edges. Given anacyclic , the ordering assumed in Theorem 2 agrees withany topological sort of the corresponding acyclic graph.(We refer to this ordering as the SC-ordering below.)

Fig. 8 provides an example of analysis. The directedgraph (a) in Fig. 8 depicts the original sequential compos-ability relationships in an -arbitration group with fivetransitions. A directed edge from to implies .Edges between , , and form a cycle., Graph (b) inFig. 8 shows the acyclic graph when the edge fromto is removed. Graph (b) yields the SC-ordering of , ,

, , and . (The order of and can be reversed also.)For a given cyclic graph, multiple acyclic derivations arepossible depending which edge is removed from a cycle. Otherpossibilities in this example would be to remove the edge from

to or from to .Graph (c) in Fig. 8 gives the corresponding conflict graph.

The connected components in the conflict graph form the-arbitration subgroups. Transitions in different -arbi-

tration subgroups are either conflict-free or sequentially-com-posable, and each subgroup can be arbitrated independently.

’s for the transitions in a -arbitration subgroup can begenerated locally within the subgroup using a priority encoder.For example, conflict graph (c) in Fig. 8 has two connectedcomponents, corresponding to two -arbitration subgroups.In the unary Arbitration Subgroup 2, without any ar-bitration. The signals for transitions in Arbitration Subgroup1 can be generated using a priority encoding of their corre-sponding ’s. More transitions in Arbitration Subgroup 1 canbe selected concurrently using an enumerated combinationallookup table logic similar to the one described in Section IV-D.

Notice, if the cycle in the original SC graph (a) had not beenbroken, would be a third separate component in the conflictgraph; this could lead to a incorrect condition where , ,and are enabled together.

C. Complications to the State-Update Logic

When transitions are allowed in the same clock cycle,the register-update logic cannot assume that at most one transi-tion acts on a register in each clock cycle. When multiple actionsare enabled for a register, the register update logic should ignoreall except for the latest action with respect to the SC-orderingof a -arbitration group. (It can be shown that all transi-tions that could update the same register in the same cycle arein the same -arbitration group.) For each in , the setof transitions that update is . Theregister’s latch enable signal remains

However, a new data-input signal must be used with anarbitrator to observe SC-ordering. The new signal is

where . The expression con-tains ’s from the set of transitions

comes before

in the ordering

In essence, the register’s data input is selected through aprioritized multiplexer that gives higher priority to transitionslater in the -ordering. Under the current definition of ,the update logic for arrays and FIFOs can remain unchangedfrom Section III.

VI. EVALUATION AND RESULTS

The synthesis procedures outlined in the previous sectionshave been implemented in the term architectural complier(TRAC). TRAC accepts TRSPEC descriptions and outputssynthesizable RTL descriptions in the Verilog hardware de-scription language. This section discusses the results fromapplying TRSPEC and TRAC to the design and synthesis of afive-stage pipelined implementation of the MIPS R2000 ISA[10].

A. Synchronous Pipeline Synthesis

As in the processor from Section II-C, the MIPS processoris described as a pipeline whose five stages are separated byFIFOs. The exact depth of the FIFOs is not specified in thedescription and, hence, TRAC is allowed to instantiate one-deep FIFOs (basically registers) as pipeline buffers. Flow-con-trol logic is added to ensure the single-register FIFOs are notoverflowed or underflowed by enqueue and dequeue actions. In


Fig. 9. Synchronous pipeline with combinational multistage feedback flowcontrol.

a naive construction, the single-register FIFO is full if its registerholds valid data; the FIFO is empty if its register holds a bubble.With only local flow control between neighboring stages, theoverall pipeline would contain a bubble in every other stage inthe steady state. For example, if pipeline buffer andare occupied and buffer is empty, the operation in stage

would be enabled to advance at the clock edge, but theoperation in stage is held back because buffer appearsfull during the same clock cycle. The operation in stage is notenabled until the next clock cycle when buffer is empty.

In TRAC-synthesized implementations, to allow simulta-neous enqueue and dequeue actions, a single-register FIFO isconsidered empty both when it is actually empty or when it isenabled for dequeuing at the next clock edge. Simultaneousenqueue and dequeue actions are permitted as a sequentialcomposition where the dequeue action is considered to havetaken place before the enqueue action. In hardware terms, thiscreates a combinational multistage feedback path for FIFOflow control that propagates from the last stallable pipelinestage to the first pipeline stage. The cascaded feedback schemeshown in Fig. 9 allows stage to advance both when pipelinebuffer is actually empty and when buffer isgoing to be dequeued at the coming clock edge. This schemeallows the entire pipeline to advance synchronously on eachclock edge. A stall in an intermediate pipeline stage causes allup-stream stages to stall at once. A caveat to this scheme is thatthe multistage feedback path could become the critical path,especially in a deeply pipelined design. In this case, one maywant to break the feedback path at select stages by insertingtwo-deep FIFOs with local flow control. A cyclic feedback pathcan also be broken by inserting a two-deep FIFO with localflow control.

B. Synthesis Results

The synthesis results are presented for a TRSPEC de-scription that implements the MIPS R2000 integer ISA withthe exception of multiple/divide, partial-word or nonalignedload/stores, coprocessor interfaces, privileged instructions andexception modes. Memory load instructions and branch/jumpinstructions also deviate from the ISA by not obeying therequired delayed-execution semantics. (The complete TRSPECdescription of the five-stage pipelined processor model is givenin [9].)

The processor description is compiled by TRAC into asynthesizable Verilog RTL description, which is subsequently

TABLE ISUMMARY OF MIPS CORE SYNTHESIS RESULTS

compiled by the {Synopsys Design Compiler} to target bothSynopsys CBA and LSI Logic 10K Series technology libraries.Table I summarizes the prelayout area and speed estimatesreported by Synopsys. The row labeled “TRSPEC” charac-terizes the implementation synthesized from the TRSPECdescription. The row labeled “Hand-coded RTL” characterizesthe implementation synthesized from a hand-coded Verilogdescription of the same five-stage pipelined microarchitecture.The results indicate that the TRSPEC description producesan implementation that is competitive in size and speed tothe implementation resulting from the hand-coded Verilogdescription. This similarity should not be surprising because,after all, both descriptions are describing the same microarchi-tecture, albeit following very different design abstractions andmethodologies. The same conclusion has also been reached oncomparisons of other designs and when we targeted the designsfor implementation on FPGAs.

The TRSPEC and the hand-coded Verilog descriptions aresimilar in length (790 versus 930 lines of source code), but theTRSPEC description was developed in less than one day (eighthours), whereas the hand-coded Verilog description requirednearly five days to complete. The TRSPEC description can betranslated in a literal fashion from an ISA manual. Whereas, thehand-coded Verilog description faces a greater representationgap between the ISA specification and RTL. The RTL designeralso needs to manually inject implementation-level informationthat is not part of the ISA-level specification. In a TRSPEC de-sign flow, the designer can rely on TRAC to correctly completethe design with suitable implementation decisions.

C. Current Synthesis Limitations

The synchronous hardware synthesis procedure presented inthis paper has two important limitations. First, the procedure al-ways maps the entire effect of an operation into a single clockcycle. Second, the procedure always maximizes hardware con-currency. In many hardware applications, there are restrictionson the amount and the type of resources available. Under thoseassumptions, it may not be optimal or even realistic to executean operation in one clock cycle. In future work, we are devel-oping another synthesis approach, where the effect of an op-eration can be executed over multiple clock cycles, while alsobeing overlapped with the execution of other multiple-cycle op-erations. The key issue in this future work also hinges on howto ensure the resulting implementation correctly maintains theatomic and sequential execution semantics of the operation-cen-tric hardware abstraction.

VII. RELATED WORK

In comparison to operation-centric abstractions, traditionalstate-centric abstractions more closely reflect the true nature


(e.g., explicit concurrency and synchronization) of the under-lying hardware implementation. Hence, they are relatively sim-pler to synthesize into hardware implementations, and they af-ford designers greater control over the details of the synthe-sized outcome. On the other hand, the additional details exposedby state-centric abstractions also place a greater design burdenon the designers. In a synchronous state-centric framework, adesigner must explicitly manage the exact cycle-by-cycle in-teractions between multiple concurrent state machines. Designmistakes are common in coordinating interactions between twostate machines because one cannot directly couple related tran-sitions in different state machines. It is also difficult to designor modify one state machine in a system without consideringits, often implicit, interactions with the rest of the system. Theatomic and sequential execution semantics of the operation-cen-tric abstraction removes these low-level design issues from thedesigner; instead the abstraction allows these low-level deci-sions to be offloaded to an optimizing compiler.

The operation-centric ATS formalism in this paper is similarin semantics to guarded commands [7], synchronized transitionswith asynchronous combinators [13] and the UNITY language[5]. ATS extends these earlier models with support for hardwarestate primitives, such as arrays and FIFOs. In particular, the useof FIFOs simplifies the description of pipelined designs (as ex-plained in Section VI-A). ATS serves as an intermediate rep-resentation for the compilation of source-level descriptions inthe TRSPEC language [9]. TRSPEC is an adaptation of TRS[1] for describing a finite-state transition system. In addition toTRSPEC, other language syntax can also be layered on top ofthe basic ATS abstraction. For example, the Bluespec languageis an operation-centric hardware description language that has aHaskell-like syntax [15]. In addition to a functional languagesyntax, Bluespec also leverages sophisticated functional lan-guage features in the declaration of state, operations and moduleinterfaces.

Staunstrup and Greenstreet describe in [17] the synthesis ofboth synchronous and asynchronous (delay insensitive) circuitsfrom a synchronized-transitions program. In their synchronoussynthesis, the entire effect of a transition is mapped into a singleclock cycle, and all enabled transitions are allowed to executein the same clock cycle. To guarantee the atomic semantics ofsynchronized transitions is not violated, they enforce the CREW(concurrent-read-exclusive-write) requirement on a synthesiz-able set of synchronized transitions. Under the CREW require-ment, a pair of transitions must be mutually exclusive in theirpredicate condition if they have data (read-after-write) or output(write-after-write) dependence between them. If initially twotransitions violate CREW, one of their predicate conditions mustbe augmented by the negation of the other so the two transitionsbecome mutually exclusive. In this way, the final synthesizedsystem exhibits a deterministic behavior that is acceptable ac-cording to the original nondeterministic system. Dhaussy et al.describe a similar approach in their synthesis of UCA, a lan-guage derived from UNITY, where a program with conflicts istransformed into a program without conflicts prior to synthesis[6].

The Staunstrup and Greenstreet’s CREW relationship is sim-ilar to our conflict-free relationship, but it is more restrictive

than both the conflict-free and the sequential composability re-quirements given in Theorems 1 and 2. In particular, the se-quential composability optimization enables significantly moreparallelism to be exploited in an implementation than CREW.However, in the case when only the conflict-free optimization isenabled in the TRAC compiler, TRAC’s conservative static testfor conflict-free relationships (discussed in Section IV-B) yieldsessentially the same scheduled behavior as CREW. Neverthe-less, our approach further differs from Staunstrup and Green-street in how the statically determined conflict resolutions arerealized in hardware. Resolving conflicting transitions by stati-cally transforming the predicate conditions is less efficient thanthe partitioned arbitrator approach in this paper because aug-menting predicate conditions can lead to a lengthy final pred-icate expression if a transition conflicts with many transitions.The complexity of the predicate expressions can directly impactan implementation’s size and critical path delay.

There is also some tangential relationship between the ATSformalism and synchronous languages [2], [3], exemplified byEsterel [4], Lustre [8], and Signal [11]. In both models, stateis updated atomically at a discrete times step but the abstractconcept of time is different in the two formalisms. Synchronouslanguages are based on a calculus of time and require explicitexpression of concurrency with deterministic behavior. On theother hand, an operation-centric abstraction employs a sequen-tial execution semantics with nondeterminism. Operation-cen-tric hardware synthesis must automatically discover and exploitthe implicit parallelism that exist in a sequentially conceivedand interpreted description. Hence, as presented in Sections IVand V, the key to synthesizing an efficient implementation froman operation-centric description lies precisely in how to take ad-vantage of the nondeterminism to select an appropriate subset ofenabled transitions for concurrent execution in each cycle. Fur-thermore, synchronous languages have been used primarily fordescribing the control part of a design and thus, it is difficult tocompare the two models using a processor-like example whichemploys rich datapaths.

VIII. CONCLUSION

Ultimately, the goal of a high-level description is to provide adesign representation that is easy for a designer to comprehendand reason about. Although a concise notation is helpful, theutility of a high-level description framework has to come fromthe elimination of some lower-level details. It is in this sensethat an operation-centric design framework can offer an advan-tage over traditional RTL design frameworks. In RTL languages,hardware concurrency must be expressed and managed explic-itly. Whereas in an operation-centric language, parallelism andconcurrency are implicit at the source-level, where they are laterdiscovered and managed by an optimizing compiler.

This paper develops the theories and algorithms necessaryto synthesize an efficient synchronous hardware implementa-tion from an operation-centric description. To facilitate analysisand synthesis, this paper defines the ATS formalism, which isan operation-centric intermediate representation when mappingTRSPEC and other operation-centric source-level languages to


hardware implementations. This paper explains how to imple-ment an ATS as a synchronous finite-state machine. The cruxof the synthesis problem lies in finding a valid composition ofthe ATS transitions in a coherent finite-state machine that car-ries out as many ATS transitions concurrently as possible. Thispaper first presents a straightforward reference implementationand then offers two optimizations based on the parallelization ofconflict-free and sequentially-composable transitions. The syn-thesis results from a pipelined processor design example showthat an operation-centric framework allows a significant reduc-tion in design time and effort while achieving comparable im-plementation quality as traditional register-transfer level designflows.

REFERENCES

[1] F. Baader and T. Nipkow, Term Rewriting and All That. Cambridge,U.K.: Cambridge Univ. Press, 1998.

[2] A. Benveniste and G. Berry, “The synchronous approach to reactive andreal-time systems,” Proc. IEEE, vol. 79, pp. 1270–1282, Sept. 1991.

[3] A. Benveniste, P. Caspi, S. Edwards, N. Halbwachs, P. Le Guernic, andR. de Simone, “The synchronous languages twelve years later,” Proc.IEEE, vol. 91, pp. 64–83, Jan. 2003.

[4] G. Berry, “The Foundations of Estevel,” in Proof, Language and Interac-tion: Essays in Honor of Robin Milner. Cambridge, MA: MIT Press,2000, pp. 425–454.

[5] K. M. Chandy and J. Misra, Parallel Program Design. Reading, MA:Addison-Wesley, 1988.

[6] P. Dhaussy, J.-M. Filloque, and B. Pottier, “Global control synthesis foran MIMD/FPGA machine,” in Proc. IEEE Workshop FPGAs CustomComput. Mach., 1994, pp. 72–81.

[7] E. W. Dijkstra, “Guarded commands, nondeterminacy, and formalderivation of programs,” Commun. ACM, vol. 18, no. 8, pp. 453–487,1975.

[8] N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud, “The synchronousdata flow programming language LUSTRE,” in Proc. IEEE, vol. 79,Sept. 1991, pp. 1305–1320.

[9] J. C. Hoe, “Operation-Centric Hardware Description and Synthesis,”Ph.D. thesis, Dept. Elect. Eng. Comput. Sci., Massachusetts Inst.Technol., Cambridge, June 2000.

[10] G. Kane, MIPS R2000 RISC Architecture. Englewood Cliffs, NJ: Pren-tice-Hall, 1987.

[11] P. Le Guernic, T. Gautier, M. Le Borgne, and C. Le Maire, “Program-ming real-time applications with SIGNAL,” Proc. IEEE, vol. 79, pp.1321–1336, Sept. 1991.

[12] T. W. S. Lee, C. J. Seger, and M. R. Greenstreet, “Automatic verifica-tion of asynchronous circuits,” IEEE Design Test Comput., vol. 12, pp.24–31, Spring 1995.

[13] A. P. Ravn and J. Staunstrup, “Synchronized Transitions,” Dept.Comput. Sci., Univ. Aarhus, Aarhus, Denmark, Tech. Rep. AAR-219,1987.

[14] V. M. Rodrigues and F. R. Wagner, “Synchronous transitions and theirtemporal logic,” in Proc. Workshop Métodos Formais, 1998, pp. 84–89.

[15] A brief description of Bluespec, Sandburst Corporation. Available:http://www.bluespec.org/description.html [Online]

[16] J. Staunstrup and M. R. Greenstreet, “From high-level descriptions toVLSI circuits,” BIT, vol. 28, no. 3, pp. 620–638, 1988.

[17] , “Synchronized Transitions,” in Formal Methods for VLSI De-sign. Amsterdam, The Netherlands: Elsevier, 1990, pp. 71–128.

James C. Hoe (S’91–M’00) received the B.S. degreein electrical engineering and computer science fromthe University of California, Berkeley, in 1992 andthe M.S. and Ph.D. degrees in electrical engineeringand computer science from the Massachusetts Insti-tute of Technology, Cambridge, in 1994 and 2000,respectively.

Since 2000, he has been an Assistant Professorof Electrical and Computer Engineering at CarnegieMellon University, Pittsburgh, PA. His researchinterest includes many aspects of computer architec-

ture and digital hardware design. His present focus is on developing high-levelhardware description and synthesis technologies to simplify hardware develop-ment. He is also working on innovative processor microarchitectures to addressissues in security and reliability.

Arvind (SM’85–F’95) received the B.Tech. degreein electrical engineering from the Indian Institute ofTechnology, Kanpur, in 1969 and the M.S. and Ph.D.degrees from the University of Minnesota, TwinCities, in 1972 and 1973, respectively.

He is the Johnson Professor of Computer Scienceand Engineering at the Massachusetts Instituteof Technology, Cambridge, where he has taughtsince 1979. He has contributed to the developmentof dynamic dataflow architectures, and togetherwith Dr. R. S. Nikhil published the book Implicit

Parallel Programming in pH (San Francisco, CA: Morgan Kaufmann, 2001).His current research interest is in high-level specification, modeling, and thesynthesis and verification of architectures and protocols.

Documents

IEEE TRANSACTIONS ON COMPUTER-AIDED …csg.csail.mit.edu/pubs/memos/Hoe/ArvindHoe0904IEEE.pdfIEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 23,