6
Design And Verification Of Lazy And Hybrid Implementations Of The SELF Protocol ABSTRACT Synchronous Elasticization converts an ordinary clocked de- sign into Latency-Insensitive. It uses communication pro- tocols such as SELF. Comparing to lazy implementations, eager SELF has no cycles and can have performance advan- tage. Yet, it consumes more area and power. The paper investigates the lazy SELF specifications, implementations and verification. It also introduces a hybrid implementation combining both the eager and lazy advantages. 1. INTRODUCTION Latency insensitivity (LI) [2] allows designs to tolerate arbitrary latency variations in their computation units as well as communication channels. This could be specially important for interfaces where the actual latency can not be accurately estimated or required to be flexible. Examples of the former is systems with very long interconnects. In- terconnect latency is affected by many factors that can not be accurately estimated before the final layout [3]. On the other hand, some applications require flexible interfaces that tolerate variable latencies. Examples can include interfaces to variable latency ALU’s, memories or network on chip. It has been reported that applying flexible latency design to the critical block of one of Intel SOC (H.264 CABAC) can achieve 35% performance advantage [8]. Synchronous elasticization [6, 4, 10] is a technique of con- verting an ordinary clocked design into an LI. Unlike asyn- chronous circuits, synchronous elastic circuits can be eas- ily designed with conventional design flows using STA [4, 12]. The Synchronous Elastic Flow (SELF) [6] is a com- munication protocol in synchronous elastic designs. Eager implementation of the SELF protocol [6] enjoys no cycles and also performance advantages in some designs compar- ing to lazy implementations. However, the former is more expensive in terms of area and power consumption. The LI control network area and power consumption overheads may become prohibitive in some cases [4]. Measurements of a MiniMIPS processor fabricated in a 0.5 μm node show Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. that elasticization with an eager SELF implementation re- sults in area, dynamic and leakage power penalties of 29%, 13% and 58.3%, respectively [1]. Hence, minimizing these overheads is a primary concern. An algorithm that mini- mizes the total number of control steering units (i.e., joins and forks) in the LI control network (and, hence, its area and power overheads) has been proposed in [11]. Due its lower area and power overheads, lazy SELF imple- mentation can become an attractive solution. However, it is rarely studied in literature. And this is because lazy SELF suffers from combinational cycles and runtime disadvantages in some systems [6, 1]. 1.1 Contribution In this paper we formally and exhaustively investigate the lazy SELF protocol specifications, different implementations and verification. This includes introducing new lazy join and fork structures as well as verifying existing ones. We, then, introduce a novel hybrid implementation that combines the advantages of both eager and lazy implementations. The hy- brid SELF essentially avoids some of the redundancy of the eager implementation without any performance loss. More- over, it is combinational cycle free. We demonstrate the hybrid SELF on a MiniMIPS processor case study. It shows that our hybrid implementation achieves the same runtime, maximum speed as an all eager implementation with a re- duction of 31.8% in area and up to 32.5% and 32.1% in dynamic and leakage power consumption, respectively. 2. SELF OVERVIEW An elastic system uses Elastic Buffers (EBs) as synchro- nization elements as counterparts to flipflops in ordinary clocked systems. Fig. 1 shows a block diagram of an EB [1]. EB controllers communicate through control channels. A control channel in the SELF protocol is composed of two signals. ’Valid’ (V ) in the forward direction, indicates the validity of the data coming from the transmitter. ’Stall’(S) in the backward direction, indicates the receiver is not ready to receive the incoming data on the channel. SELF identi- fies three different states on a communication channel [6]: 1. Transfer (T ): V &!S. The transmitter provides valid data and the receiver can accept it. 2. Idle (I ): !V . The trans- mitter does not provide valid data. In this paper we like to differentiate between two substates of the Idle (I ) state, namely: I 0 (!V &!S) and I 1 (!V &S). 3. Retry (R): V &S. The transmitter provides valid data, but the receiver can not accept it. The transmitter will sustain the valid data until the receiver is able to read it. Hence, SELF protocol

Design And Verification Of Lazy And Hybrid Implementations Of The SELF Protocolkilada/Publications/LS.pdf · 2010-11-01 · Design And Verification Of Lazy And Hybrid Implementations

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Design And Verification Of Lazy And Hybrid Implementations Of The SELF Protocolkilada/Publications/LS.pdf · 2010-11-01 · Design And Verification Of Lazy And Hybrid Implementations

Design And Verification Of Lazy And HybridImplementations Of The SELF Protocol

ABSTRACTSynchronous Elasticization converts an ordinary clocked de-sign into Latency-Insensitive. It uses communication pro-tocols such as SELF. Comparing to lazy implementations,eager SELF has no cycles and can have performance advan-tage. Yet, it consumes more area and power. The paperinvestigates the lazy SELF specifications, implementationsand verification. It also introduces a hybrid implementationcombining both the eager and lazy advantages.

1. INTRODUCTIONLatency insensitivity (LI) [2] allows designs to tolerate

arbitrary latency variations in their computation units aswell as communication channels. This could be speciallyimportant for interfaces where the actual latency can not beaccurately estimated or required to be flexible. Examplesof the former is systems with very long interconnects. In-terconnect latency is affected by many factors that can notbe accurately estimated before the final layout [3]. On theother hand, some applications require flexible interfaces thattolerate variable latencies. Examples can include interfacesto variable latency ALU’s, memories or network on chip. Ithas been reported that applying flexible latency design tothe critical block of one of Intel SOC (H.264 CABAC) canachieve 35% performance advantage [8].

Synchronous elasticization [6, 4, 10] is a technique of con-verting an ordinary clocked design into an LI. Unlike asyn-chronous circuits, synchronous elastic circuits can be eas-ily designed with conventional design flows using STA [4,12]. The Synchronous Elastic Flow (SELF) [6] is a com-munication protocol in synchronous elastic designs. Eagerimplementation of the SELF protocol [6] enjoys no cyclesand also performance advantages in some designs compar-ing to lazy implementations. However, the former is moreexpensive in terms of area and power consumption. TheLI control network area and power consumption overheadsmay become prohibitive in some cases [4]. Measurementsof a MiniMIPS processor fabricated in a 0.5 µm node show

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.

that elasticization with an eager SELF implementation re-sults in area, dynamic and leakage power penalties of 29%,13% and 58.3%, respectively [1]. Hence, minimizing theseoverheads is a primary concern. An algorithm that mini-mizes the total number of control steering units (i.e., joinsand forks) in the LI control network (and, hence, its areaand power overheads) has been proposed in [11].

Due its lower area and power overheads, lazy SELF imple-mentation can become an attractive solution. However, it israrely studied in literature. And this is because lazy SELFsuffers from combinational cycles and runtime disadvantagesin some systems [6, 1].

1.1 ContributionIn this paper we formally and exhaustively investigate the

lazy SELF protocol specifications, different implementationsand verification. This includes introducing new lazy join andfork structures as well as verifying existing ones. We, then,introduce a novel hybrid implementation that combines theadvantages of both eager and lazy implementations. The hy-brid SELF essentially avoids some of the redundancy of theeager implementation without any performance loss. More-over, it is combinational cycle free. We demonstrate thehybrid SELF on a MiniMIPS processor case study. It showsthat our hybrid implementation achieves the same runtime,maximum speed as an all eager implementation with a re-duction of 31.8% in area and up to 32.5% and 32.1% indynamic and leakage power consumption, respectively.

2. SELF OVERVIEWAn elastic system uses Elastic Buffers (EBs) as synchro-

nization elements as counterparts to flipflops in ordinaryclocked systems. Fig. 1 shows a block diagram of an EB[1]. EB controllers communicate through control channels.A control channel in the SELF protocol is composed of twosignals. ’Valid’ (V ) in the forward direction, indicates thevalidity of the data coming from the transmitter. ’Stall’(S)in the backward direction, indicates the receiver is not readyto receive the incoming data on the channel. SELF identi-fies three different states on a communication channel [6]:1. Transfer (T ): V&!S. The transmitter provides valid dataand the receiver can accept it. 2. Idle (I): !V . The trans-mitter does not provide valid data. In this paper we liketo differentiate between two substates of the Idle (I) state,namely: I0 (!V&!S) and I1 (!V&S). 3. Retry (R): V&S.The transmitter provides valid data, but the receiver cannot accept it. The transmitter will sustain the valid datauntil the receiver is able to read it. Hence, SELF protocol

Page 2: Design And Verification Of Lazy And Hybrid Implementations Of The SELF Protocolkilada/Publications/LS.pdf · 2010-11-01 · Design And Verification Of Lazy And Hybrid Implementations

Figure 1: An EB imple-mentation.

Figure 2: An n-to-1 lazyjoin.

(a) A 1-to-n EFork. (b) A 1-to-n lazy fork.

Figure 3: Sample forks.

prohibits a transition from R to I states. When there aremore than one transmitter and one receiver EBs, a controlnetwork is required to connect the different EBs. A controlnetwork is composed of control channels connected throughcontrol steering units, namely, join and fork components. Ajoin element joins two or more incoming control channelsinto one output control channel. A fork element forks oneincoming control channel into two or more output controlchannels. Fork and join components are represented in thispaper by � and ⊗, respectively. Finally, the SELF proto-col used over the control channels can be implemented inan eager or lazy flavors. For brevity, we will refer to theimplementations as simply eager (or lazy) SELF.

3. EAGER SELF PROTOCOLAn eager SELF implementation uses eager forks (EFork’s)

and lazy joins. We defer the discussion of lazy joins till Sec-tion 4.2. Fig. 3a [1] shows an n output extension of theEFork proposed in [6]. Once a (valid) data token is avail-able at an EFork stem, it will immediately pass it to allits branches. Meanwhile, the EFork will stall untill all itsbranches recieve the data token. This gives an early start tothe branches that are ready (i.e., their corresponding stallsignals are zero). Hence, EFork can result in performanceadvantage over lazy forks in some systems. However, EForkincorporates one flip flop per branch that is triggered everyclock cycle even if there is no activity in the control network.Moreover, eager forks have higher logic complexity compar-ing to lazy. All of that render EFork expensive in terms ofboth area and power consumption.

4. LAZY SELF PROTOCOL

Figure 4: Lazy fork spec-ifications (Vr1).

Figure 5: Lazy fork veri-fication setup.

Table 1: Mapping between published and this paperlazy forks and joins

Fork [10] LF0000 Join [10] LJ0000Fork [6] LF0000 Join [6] LJ0000LFork [1] LF0000 LJoin [1] LJ0000

LKFork [1] LF0001 LKJoin [1] LJ1111

4.1 Lazy ForkA sample lazy fork is shown in Fig. 3b [10, 6, 1]. A lazy

fork does not propagate a (valid) data token from its stemto its ready branches until all branches are ready to receive.

4.1.1 SpecificationWe take a 2-output-channel fork as an example:

1. Sl = Sr1|Sr2.

2. (Vr1 = 0) if ((Vl = 0)&(Sr1 = 0).

3. (Vr1 = 0) if ((Vl = 1)&(Sr1 = 0)&(Sr2 = 1)).

4. (Vr1 = 1) if ((Vl = 1)&(Sr1 = 0)&(Sr2 = 0)).

5. No R→ I transition on all channels.

6. No I0→ I1 transition on all channels.

In other words, the stem stalls if any of the branches stall(Spec. 1). Spec. 2 prevents invalid data writing. Spec. 3prevents valid data overwriting. Spec. 4 guarantee propervalid data writing. Specs. 5 - 6 are required for SELF pro-tocol compliance. Note that Spec. 5 is not explicit in theSELF specifications. However, it can be easily inferred fromthe EB specifications in [6] or the EHB (elastic half buffer)in [9]. Specs. 2 - 4 are illustrated in the Karnaugh map(KM) of Fig. 4. The numbers at some square corners of theKM are the corresponding specification indices. Similar KMcan be drawn for Vr2. The KM has 4 don’t care mintermsdenoted by x0, x1, x2, x3. This allows for 16 different imple-mentations of lazy forks that meet Specs. 2 - 4. However,as we will show shortly, only 2 out of these 16 meet Specs.5 and 6 as well. We will denote each implementation asLFx0x1x2x3 (e.g., LF0101, LF1111,.. etc). Table 1 mapsall the published lazy forks, as far as we know of, to thoseof this paper.

4.1.2 VerificationWe use the setup of Fig. 5 to verify the different lazy fork

implementation compliance with the SELF protocol. Thewhole structure is modeled and passed to a symbolic modelchecker, NuSMV [5]. The fork stem as well as the branchesare connected to three elastic buffers (EBs), respectively.A variable delay element (VDELAY) is added at each in-put of the fork to model arbitrary input arrival times. The

Page 3: Design And Verification Of Lazy And Hybrid Implementations Of The SELF Protocolkilada/Publications/LS.pdf · 2010-11-01 · Design And Verification Of Lazy And Hybrid Implementations

shown blocks as well as a clock generator are all connectedsynchronously in NuSMV.

For example, we use the following unit delay LF0000model:MODULE LF0000_u(Vl,Sr1,Sr2)

VAR Sl:boolean; VAR Vr1:boolean; VAR Vr2:boolean;

ASSIGN

init(Sl):= 0; next(Sl):= Sr1 | Sr2;

init(Vr1):= 1; next(Vr1):= Vl & (!Sr1) & (!Sr2) ;

init(Vr2):= 1; next(Vr2):= Vl & (!Sr1) & (!Sr2) ;

The EB implementation is the same as in [6] and its NuSMVmodel is omitted due to space limitation. The VDELAY el-ement has an inertial, undeterminstic, yet bounded (from1 to 3 units) delay. Its NuSMV model is also omitted dueto space limitation. Finally, for each of the 16 lazy forkimplementations, the following properties are checked:

1. Deadlock freedom. To check that the lazy forks aredeadlock free, we follow the same methodology as in[13]. We check that for each and all components in Fig.5, at least two states can be reached from any otherreachable state. For example, inside the LF0000 umodule we check the following CTL properites:SPEC AG EF (Vr1 = 1&Vr2 = 1&Sl = 0);

SPEC AG EF (Vr1 = 0&Vr2 = 0&Sl = 0);

Note that a state in LF0000 u is defined by the threestate variables: Vr1, Vr2 and Sl.

2. No R → I transition on all the input/output channels.To check that all the channels (i.e., channels A, A1,A2 in Fig. 5) don’t go through R → I transition wecheck the following PSL properties:DEFINE R_A :=Clk & VA & SA ; -- Retry on channel A

DEFINE I_A :=Clk & !VA ; -- Idle on channel A

...-- We similarly define R_A1, I_A1, R_A2, I_A2

PSLSPEC never {[*];R_A[+]; !Clk[+]; Clk[+]; I_A[+];!Clk[+]};

-- We do the same for the other channels as well.

Note that, by the specification of the EB [6], all thecontrol signals (valid and stall) change after the pos-itive edge of the clock. Nonetheless, glitches at thefork outputs can happen due to different input arrivaltimes (e.g., Sr1 may change before Sr2 causing a glitchon Vr1 output). Therefore, and without loss of gen-erality, we use a clock generator with a clock periodlong enough for all possible glitches to settle during itspositive phase. Hence, the PSL properties mentionedabove will check that no channel will undergo a tran-sition from a Retry state at the end of one clock cyclepositive phase (i.e., after all possible glitches settle) toan Idle state at the end of the following clock cycle pos-itive phase. The NuSMV model of the clock generatoris omitted due to space limitations.

3. No I0→ I1 transition on all the input/output channels.Similarly, to check that all the channels (i.e., channelsA, A1, A2 in Fig. 5) don’t go through I0 → I1 tran-sition we check the following PSL properties:DEFINE I0_A :=Clk & !VA & !SA ; -- Idle0 on A

DEFINE I1_A :=Clk & !VA & SA ; -- Idle1 on A

...-- We similarly define I0_A1, I1_A1, I0_A2, I1_A2

PSLSPEC never {[*];I0_A[+]; !Clk[+]; Clk[+]; I1_A[+];!Clk[+]};

-- We do the same for the other channels as well.

Figure 6: Lazy join spec-ification (Sl1).

Figure 7: Lazy join verifi-cation setup.

All the 16 lazy fork implementations fail at least one of thechecks except LF0000 and LF0001. The EFork of Section3 (when verified using the same setup) passes all the checks,and, hence, compliant with the SELF protocol.

4.2 Lazy JoinA sample lazy join is shown in Fig. 2 [1]. A lazy join has

to wait for all its input branch channels to carry valid datauntil it asserts its output channel valid signal.

4.2.1 SpecificationWe take a 2-input-channel join as an example:

1. Vr = Vl1&Vl2.

2. (Sl1 = 1) if ((Sr = 1)&(Vl1 = 1)).

3. (Sl1 = 1) if ((Sr = 0)&(Vl1 = 1)&(Vl2 = 0)).

4. (Sl1 = 0) if ((Sr = 0)&(Vl1 = 1)&(Vl2 = 1)).

5. No R→ I transition on all channels.

6. No I0→ I1 transition on all channels.

In other words, the output data is valid only if the two inputbranch data are valid (Spec. 1). Specs. 2 - 3 preventsvalid data overwriting. Spec. 4 guarantees proper validdata writing. Specs. 5 - 6 are required for SELF protocolcompliance. Specs. 2 - 4 are illustrated in the Karnaugh map(KM) of Fig. 6. Similar KM can be drawn for Sl2. The KMhas 4 don’t care minterms denoted by x0, x1, x2, x3. Thisallows for 16 different implementations of lazy joins thatmeet Specs. 2 - 4. However, as we will show shortly, only6 out of these 16 meet Specs. 5 and 6 as well. We willdenote each implementation as LJx0x1x2x3 (e.g., LJ0101,LJ1111,.. etc). Table 1 maps all the published lazy joins,as far as we know of, to those of this paper.

4.2.2 VerificationSimilar to lazy fork verification in Sec. 4.1.2, we use the

structure of Fig. 7 to verify the different lazy join imple-mentations. We check the following properties: 1. Deadlockfreedom. 2. No R → I. 3. No I0 → I1 transitions on allthe input/output channels. All the 16 lazy join implementa-tions fail at least one of the checks except LJ0000, LJ0010,LJ0011, LJ1010, LJ1011, LJ1111.

4.3 Lazy Fork And Join CombinationsUnlike eager forks, lazy forks have no state holding ele-

ments (e.g., flip flops). Hence, arbitrary connections of lazyjoins and forks in a control network typically result in com-binational loops. Authors of [1] studied the formed loopsunder 4 different lazy fork-join combinations (see Table 1).

Page 4: Design And Verification Of Lazy And Hybrid Implementations Of The SELF Protocolkilada/Publications/LS.pdf · 2010-11-01 · Design And Verification Of Lazy And Hybrid Implementations

They classified the type of loops that can occur into threecategories. 1. Deadlock Loops - D: A loop is a deadlock loopif under certain scenarios (sequence of inputs) its internalsignals can get stuck at certain vaules. 2. Logically UnstableLoops - LU: A loop that has an odd number of invertingelements. It behaves as a ring oscillator under certain inputcombinations. 3. Potentially Unstable Loops - PU: A loopthat is not a deadlock and has even number of inverting el-ements. However, since the loop has more than one input,and in some input sequences, both logic one and zero valuesmay be injected in the loop simultaneously. This can resultin both values oscillating around the loop. In this paper wepresented 2 SELF compliant forks and 6 SELF compliantjoins. It is easy to show that the 12 fork-join combinationswill suffer the same problems mentioned above. Thougha solution for PU loop oscillation may be found by forc-ing network-specific timing constraints on the whole controlnetwork [1]. Nonetheless, in Section 5, we propose a solu-tion not only for PU loop oscillations, but also for all lazyfork-join combinational loop problems.

5. HYBRID SELF PROTOCOLSince we verified that both the eager fork presented in [6]

as well as 2 lazy forks and 6 lazy joins presented in this paperare compliant with the SELF protocol. Therefore, eager andlazy forks (and joins) can be correctly connected together.Eager forks exhibit no cycles and can achieve better runtimein some systems. However, they consume more power andarea than lazy forks. Hence, we propose to use a hybridSELF implementation, that uses both eager and lazy forks,has no cycles and achieves the same runtime as an all eagerimplementation. We propose to implement the whole controlnetwrok with lazy forks and joins. And, to use eager forksonly in some locations for:

5.1 Cycle CuttingSection 4.3 showed that lazy fork-join combinations can

result in combinational cycles. Such cycles can cause dead-lock or oscillation. These cycles can be completely avoidedby replacing lazy forks with eager in places where cyclesexist. Cycles can be easily identified either by hand anal-ysis of the control network or through synthesis tools (e.g.,report_timing -loops command in Design Compiler).LF0001 enjoys that there is no internal path in the fork

that connects any of its branches stall to its correspond-ing valid. This reduces the cycles substantially. Similarly,LJ1011 enjoys that there is no internal path in the join thatconnects any of its input channels valid to its correspond-ing stall. This also reduces the cycles substantially. Hence,the fork-join combination of LF0001 − LJ1011 results inthe minimum number of combinational cycles among all theother fork-join combinations. This, in turn, minimizes theneed to use eager forks to cut the cycles, resulting in mini-mizing the total area and power consumption of the hybridcontrol network.

5.2 Runtime BoostingEager forks can enjoy better performance than lazy due

to the early start they provide for ready branches (Sec. 3).However, an analysis of the state table of the eager forkshows that a 2-output eager fork will behave exactly thesame as a 2-output lazy fork (LF0000 and LF0001) if thefollowing conditions are met:

1. The eager fork registers are all reset to logic one.

2. The following input combinations are avoided:

(a) (Vl = 1)&(Sr1 = 0)&(Sr2 = 1)

(b) (Vl = 1)&(Sr1 = 1)&(Sr2 = 0)

(c) (Vl = 1)&(Sr1 = 1)&(Sr2 = 1)

The detailed analysis is omitted due to space limitation. Theresults can be easily extended to n-output forks with n > 2.Hence, all the eager forks in the control network that meetthe above conditions can be safely replaced by lazy forks.The result will be a hybrid control network (incorporatingboth eager and lazy forks) that has the same runtime of anall eager network.

Algorithms to identify which eager forks can be replacedby lazy in a network are currently being developed. For thetime being, we use simulation-based analysis. In such ap-proach, the closed eager control network is simulated andall the fork valid and stall patterns are analyzed. An exam-ple will be shown in the MiniMIPS case study in Section 6.Starting with an elastic control network (generated manu-ally or through automatic tools like CNG [11]), we proposethe following flow to generate a hybrid SELF implementa-tion (H) of that network:

1. Define the set of all forks in the control network, Φ.

2. Construct a pure eager implementation of the controlnetwork, E1, such that each fork F ∈ Φ is an eagerfork. Define the set of forks, Φs, that do not meetthe conditions mentioned above. Φs are the forks thatmust be implemented as eager to achieve the same run-time as a purely eager implementation of the controlnetwork.

3. Construct an intermediate hybrid network, H1, suchthat: each fork F ∈ Φ − Φs is a lazy fork, and eachfork F ∈ Φs is an eager fork.

4. In H1, identify the set of forks, Φc, that need to bereplaced by eager forks to cut the combinational cycles.

5. Build a final hybrid network,H, such that: each forkF ∈ Φ−Φs−Φc is lazy, and each F ∈ Φs∪Φc is eager.

6. MINIMIPS CASE STUDY AND RESULTSMiniMIPS is an 8-bit subset of the 32-bit MIPS processor

[14, 7]. The MiniMIPS has been elasticized and its controlnetwork was generated by CNG tool in [11]. We use thesame control network except for adding 0 to 3 bubbles (i.e.,EBs that hold no valid data) at the register file output (i.e.,at the inputs of A and B registers simulatneously). In prac-tice, this might be done, for example, to accomodate a highlatency register file without affecting the functionality of thewhole system. The resultant control network is shown in Fig.8. From the control point of view, the register file (R) in amicroprocessor could be considered as combinational units[6]. Hence, we did not incorporate a separate EB for theregister file (R) in Fig. 8.

Running the testbench program in [14] on the MiniMIPSand counting the number of clock cycles required to finishthe program, an all lazy implementation of the SELF proto-col has a 32.7% and 58.8% longer runtime than an all eager

Page 5: Design And Verification Of Lazy And Hybrid Implementations Of The SELF Protocolkilada/Publications/LS.pdf · 2010-11-01 · Design And Verification Of Lazy And Hybrid Implementations

Figure 8: Control network of the elastic clockedMiniMIPS. Adapted from [11].

Figure 9: Stall patterns at thebranches of FC in the presenceof bubbles.

Figure 10: Hy-brid implemen-tation of FC

implementation in case of 1 and 3 bubbles, respectively [1].The reason is the performance advantage of eager forks de-scribed earlier in Section 3. To achieve the same runtimeof the eager implementation using a hybrid implementation,we have to determine which forks should be implementedas eager (See Section 5.2). To do that, we simulate the ea-ger implementation of the network with 0 to 3 bubbles inthe register file outputs. We, then, investigate the valid andstall patterns in all the forks of the design. Analysis of thiscase, shows that, for all forks except FC and FL, all forksreceive valid and stall patterns that meet the conditions ofSection 5.2. Hence, all the forks (except FC and FL) can besafely implemented as lazy forks without any performanceloss. For FC, we observe repetitive stall patterns similar tothose shown in Fig. 9. It is obvious that the stall patterns atC1 and C3 meet the conditions of Section 5.2 (they do notstall at all). Hence, branches C1 and C3 can be safely con-nected through a lazy fork. Similarly, the stall patterns atbranches C2 and C4 violate one of the conditions of Section5.2 (both branches stall at the same time, when the inputvalid is 1). Hence, branches C2 and C4 should be connectedthrough an eager fork to achieve the same runtime as theeager implementation of the control network. Finally, thestall pattern of these two forks violate the conditions of Sec-tion 5.2 (one branch stalls while the other does not). Hence,the two forks should be connected through one third eagerfork. This results in the hybrid fork in Fig. 10. EF and LFin 10 refer to eager and lazy forks, respectively. Similarly,to achieve the same runtime of an eager implementation, wehad to use a 3-output eager fork to implement FL.

To cut the cycles in the lazy implementations of the Min-iMIPS control network, we had to replace more lazy forks

with eager forks. Some lazy fork-join combinations exhibitmore cycles than other and, hence, require more eager forksreplacements. We implemented the MiniMIPS control net-work using all the correct 12 lazy fork-join combinations(with some eager forks replacements). We also implementedthe network with an all eager implementation.

Table 2 shows the synthesis results. The MiniMIPS con-trol network has been synthesized separately from the datapath. All combinations have passed post synthesis simu-lation (with 0 to 3 bubbles). To verify, we ran the Min-iMIPS testbench program in [14]. Column 1 in Table 2lists the different combinations (sorted by their area). Col-umn 2 lists the eager fork replacements in each implemen-tation. Unsurprisingly, LF0001 − LJ1011 needs the leastnumber of eager fork replacements (See Section 5.1), tyingwith LF0000 − LJ1011 in this specific network. Column 3lists the number of combination cycles in the control network(after eager fork replacements), which is zero for all of them.Column 4 lists the sythesis area. LF0000−LJ1011 requiresminimum area among all with 31.8% reduction comparing toan eager implementation. LF0001− LJ1011 comes second.

Column 5-7 lists the dynamic and leakage power consump-tion reported by the synthesis tool. Power is calculated withdifferent number of bubbles inserted at the output of the reg-ister file. To accurately estimate the power, we simulated thesynthesized netlist and generated an saif file that was readby the synthesis tool to calculate the power. Synthesis andsimulation was done at 4 ns clock period for all the imple-mentations. LF0000 − LJ1011 consumes the least poweramong all with up to 32.5% and 32.1% dynamic and leak-age power reduction comparing to an eager implementation.LF0001− LJ1011 comes second.

Column 8-10 lists the required runtime (in terms of num-ber of clock cycles) to finish the testbench program in [14].With the shown eager fork replacements all the hybrid im-plementations achieve the same runtime as the eager imple-mentation.

Finally column 11 lists the maximum speed of the controlnetwork in the different implementations. The control net-work was syntheiszed separately at a target clock period of4 ns. For post synthesis simulation, we used a behavioral(zero delay) model of the MiniMIPS datapath. This ensuresthat the speed measurements will be limited by the controlnetwork and not the data path. All implementations hadthe same speed limit of 1 GHz, due to max frequency limi-tations of the latches and flipflops used in the EB controllersand eager forks.

7. CONCLUSIONEager implementation of the SELF protocol enjoys cycle-

freedom and performance advantages in some systems com-paring to lazy implementations. However, the former is moreexpensive in terms of area and power consumption. In thispaper we formally and exhaustively investigated the lazySELF specifications, different implementations and verifica-tion. We also introduced a novel hybrid implementation thatcombines the advantages of both eager and lazy elements.It is cycle free and has the same peformance as an all eagerimplementation. A MiniMIPS case study showed that ourhybrid implementation, achieves the same runtime, maxi-mum speed as the eager implementation with a reduction of31.8% in area and up to 32.5% and 32.1% in dynamic andleakage power consumption, respectively.

Page 6: Design And Verification Of Lazy And Hybrid Implementations Of The SELF Protocolkilada/Publications/LS.pdf · 2010-11-01 · Design And Verification Of Lazy And Hybrid Implementations

Table 2: Area, Power, Runtime, Maximum Speed of MiniMIPS Control Network Using Different Fork-JoinCombinations

Combination Eager Forks Used nCyclesArea Power @ 4ns

Pdyn

Pleakage

(µW) Runtime (nCycles) Max Speed

(µ2) 0 B 1 B 3 B 0 B 1 B 3 B (GHz)

F0000− J1011 Some branches of FC, FL 0 513.058.187

1.980

164.284

1.990

122.720

1.99298 147 245 1

F0001− J1111 FC, FL, FBCP 0 575.465.626

2.339

188.094

2.307

140.389

2.27898 147 245 1

F0001− J1011 Some branches of FC, FL 0 588.058.187

2.640

183.991

2.536

134.636

2.54298 147 245 1

F0001− J0000 FC, FL, FBCP 0 634.265.626

2.739

194.001

2.663

143.822

2.59998 147 245 1

F0000− J1111 FC, FL, FBCP, FMem, FABCI4P 0 639.074.475

2.525

206.882

2.514

155.145

2.49998 147 245 1

F0001− J0011 FC, FL, FBCP 0 646.865.626

2.738

192.545

2.672

143.065

2.61798 147 245 1

F0001− J1010 FC, FL, FBCP 0 649.864.710

2.761

197.261

2.691

145.481

2.63198 147 245 1

F0001− J0010 FC, FL, FBCP 0 653.465.635

2.685

191.208

2.642

142.149

2.59898 147 245 1

F0000− J000 FC, FL, FBCP, FMem, FABCI4P 0 683.474.933

2.825

196.338

2.762

148.919

2.71398 147 245 1

F0000− J0011 FC, FL, FBCP, FMem, FABCI4P 0 695.474.933

2.790

198.957

2.742

150.580

2.69998 147 245 1

F0000− J0010 FC, FL, FBCP, FMem, FABCI4P 0 698.474.475

2.853

202.539

2.838

152.374

2.81198 147 245 1

F0000− J1010 FC, FL, FBCP, FMem, FABCI4P 0 704.473.101

2.887

205.521

2.867

153.914

2.84498 147 245 1

EFork − LJ0000 ALL 0 752.486.158

2.914

221.921

2.875

168.807

2.84298 147 245 1

8. REFERENCES[1] Reference is omitted for blind review.

[2] L. Carloni, K. Mcmillan, and A. L.Sangiovanni-VincentelliR. Theory of latencyinsensitive design. In IEEE Transactions on CAD ofIntegrated Circuits and Systems, volume 20, pages1059–1076, Sep 2001.

[3] L. Carloni and A. Sangiovanni-Vincentelli. Copingwith latency in soc design. Micro, IEEE, 22(5):24–35,Sep/Oct 2002.

[4] J. Carmona, J. Cortadella, M. Kishinevsky, andA. Taubin. Elastic circuits. Computer-Aided Design ofIntegrated Circuits and Systems, IEEE Transactionson, 28(10):1437–1455, Oct. 2009.

[5] A. Cimatti, E. Clarke, E. Giunchiglia, F. Giunchiglia,M. Pistore, M. Roveri, R. Sebastiani, and A. Tacchella.Nusmv 2: An opensource tool for symbolic modelchecking. In Proc. of 14th Conf. on Computer AidedVerification (CAV 2002), volume 2404, July 2002.

[6] J. Cortadella, M. Kishinevsky, and B. Grundmann.Synthesis of synchronous elastic architectures. InACM/IEEE Design Automation Conference, pages657–662, July 2006.

[7] J. H. et al. The MIPS Machine. In COMPCON, pages2–7, 1982.

[8] A. Gotmanov, M. Kishinevsky, and M. Galceran-Oms.Evaluation of flexible latencies: designing synchronouselastic h.264 cabac decoder. In The Problems in designof micro- and nano-electronic systems, 2010.

[9] G. Hoover and F. Brewer. Synthesizing synchronouselastic flow networks. In Design, Automation and Testin Europe, 2008. DATE ’08, pages 306 –311, 10-142008.

[10] H. M. Jacobson, P. N. Kudva, P. Bose, P. W. Cook,S. E. Schuster, E. G. Mercer, and C. J. Myers.Synchronous interlocked pipelines. In 8th InternationalSymposium on Asynchronous Circuits and Systems,pages 3–12, Apr. 2002.

[11] E. Kilada and K. Stevens. Control network generatorfor latency insensitive designs. In Design, Automation& Test in Europe Conference Exhibition (DATE),2010, pages 1773 –1778, March 2010.

[12] S. Krstic, J. Cortadella, M. Kishinevsky, andJ. O’Leary. Synchronous elastic networks. In FormalMethods in Computer Aided Design, 2006. FMCAD’06, pages 19–30, Nov. 2006.

[13] V. Vakilotojar and P. Beerel. Rtl verification of timedasynchronous and heterogeneous systems usingsymbolic model checking. In Design AutomationConference 1997. Proceedings of the ASP-DAC ’97.Asia and South Pacific, pages 181 –188, 28-31 1997.

[14] N. Weste and D. Harris. CMOS VLSI design: a circuitand systems perspective. 2004.