46
Optimization and Design Tips for FPGA/ASIC (How to make the best designs) Mohit Arora Prashant Bhargava Amit Srivastava 125, Udyog Vihar, Phase I, Gurgaon - 122016, India (www.dcmtech.com ) ABSTRACT Designing can be anyone’s cup of tea … but it is surely not a bed of roses. Developing a good and robust design is what really matters and contributes to the development of design on FGPA and finally on an ASIC. Tips, tricks and ideas presented in this paper are like a small drop in the ocean of efficient designing techniques, but they surely will help designers to take the first step towards developing efficient designs. The scope of this paper is to provide designers with state machine coding styles, efficient ways of writing code, portability from FPGA to ASIC, implementation of internal memories in FPGA, design tips for multiple clock designs, clock gating, clock management, using resets efficiently, synchronous designs and problem with latches.

SNUG Design Tips Paper

  • Upload
    sujaata

  • View
    63

  • Download
    7

Embed Size (px)

Citation preview

Page 1: SNUG Design Tips Paper

Optimization and Design Tips for FPGA/ASIC (How to make the best designs)

Mohit Arora Prashant Bhargava

Amit Srivastava

125, Udyog Vihar, Phase I,

Gurgaon - 122016, India (www.dcmtech.com)

ABSTRACT Designing can be anyone’s cup of tea … but it is surely not a bed of roses. Developing a good and robust design is what really matters and contributes to the development of design on FGPA and finally on an ASIC. Tips, tricks and ideas presented in this paper are like a small drop in the ocean of efficient designing techniques, but they surely will help designers to take the first step towards developing efficient designs. The scope of this paper is to provide designers with state machine coding styles, efficient ways of writing code, portability from FPGA to ASIC, implementation of internal memories in FPGA, design tips for multiple clock designs, clock gating, clock management, using resets efficiently, synchronous designs and problem with latches.

Page 2: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 - i -

Index 1. State Machine Coding Styles: ................................................................................................. 1

(a) Types of State Machines:...................................................................................................... 1 i. Binary (Binary-Sequential) Encoded State Machine: ...................................................... 2 ii. Gray Code Encoded State Machine: ............................................................................... 3 iii. One-Hot Encoded State Machine:................................................................................ 3

(b) Synthesis Issues:................................................................................................................... 4 i. Binary-Sequential State Machine code ........................................................................... 4 ii. Gray Encoded State Machine code................................................................................. 5 iii. One-Hot State Machine code ....................................................................................... 6

(c) Choosing a State Machine Encoding Style: .......................................................................... 7 (d) Guidelines:............................................................................................................................. 8

2. Efficient Coding/Design Tips: .................................................................................................. 9 3. Clock Management................................................................................................................ 20 4. Requirement of PLL/DLL....................................................................................................... 21

(a) Delay Lock Loops (DLL) ...................................................................................................... 21 i. Working of a DLL............................................................................................................ 21

(b) Phase Locked Loop (PLL) ................................................................................................... 23 5. Gated Clocks ......................................................................................................................... 23 6. Problem with latches.............................................................................................................. 25 7. Using reset correctly.............................................................................................................. 26 8. Clock skew problem............................................................................................................... 28

(a) Effect of Clock Skew on Max. Frequency .......................................................................... 31 9. Synchronous Design.............................................................................................................. 31 10. Multiple Clock Domain ....................................................................................................... 32 11. Designing for Portability (FPGA to ASIC).......................................................................... 33 12. Implementing Internal Memories in FPGA ........................................................................ 34

(a) Implementing CAM (Content Addressable Memory) .......................................................... 35 i. Resource Usage:............................................................................................................ 35 ii. Writing to CAM............................................................................................................... 36 iii. Reading from CAM..................................................................................................... 36

(b) Implementing RAM/ROM..................................................................................................... 37 i. Resource Usage:............................................................................................................ 38 ii. Writing to RAM............................................................................................................... 38 iii. Reading from RAM..................................................................................................... 38

(c) Example of a sample DUT................................................................................................... 39 13. Conclusion ......................................................................................................................... 41 14. Acknowledgements ............................................................................................................ 41 15. References......................................................................................................................... 41 16. Author & Contact information............................................................................................. 42

Page 3: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 - ii -

List of Figures

Figure 1: FSM Block Diagram......................................................................................................... 1 Figure 2: Synthesis results for different state machine encoding styles........................................ 7 Figure 3: A much better State Machine.......................................................................................... 8 Figure 4: Test Mode Setup ........................................................................................................... 13 Figure 5: No Pipelining.................................................................................................................. 19 Figure 6: Pipelining ....................................................................................................................... 20 Figure 7: Working of a DLL........................................................................................................... 21 Figure 8: 180 degree phase difference between Input clock and feedback clock....................... 22 Figure 9: Feedback clock made in phase with the Input clock .................................................... 22 Figure 10: Gated Clocks ............................................................................................................... 23 Figure 11: Propagation Delay in Gated Clocks ............................................................................ 24 Figure 12: Clock Enable Flip-Flops .............................................................................................. 24 Figure 13:Race condition in Latches ............................................................................................ 25 Figure 14:Inferred Latch due to incomplete ‘if else’ statement .................................................... 25 Figure 15:Combinational loop implemented due to incomplete ‘if-else’ statement ..................... 26 Figure 16: Set Reset Flip Flop used in Design............................................................................. 27 Figure 17: Set Reset pins of Flip Flop used instead of SR Flip Flop........................................... 28 Figure 18: Problem of Clock Skew ............................................................................................... 29 Figure 19: Malfunctioning of Flip Flops due to clock skew .......................................................... 30 Figure 20: Resolution of Clock Skew Problem............................................................................. 30 Figure 21: Flip Flop Parameters ................................................................................................... 31 Figure 22: Dual Clock Domain...................................................................................................... 32 Figure 23:Handshaking signaling ................................................................................................. 33 Figure 24: Section of Internal Architecture of ALTERA FPGA..................................................... 34 Figure 25: Content Addressable Memory..................................................................................... 35 Figure 26: Single Match Mode (CAM) .......................................................................................... 36 Figure 27:Timing diagram for Read/Write into CAM.................................................................... 37 Figure 28: RAM & ROM in Altera FPGAs..................................................................................... 38 Figure 29: A Sample DUT............................................................................................................. 39

Page 4: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 - iii -

Synopsis of Tips in this Document

Tip 1: Choosing an Encoding Style ................................................................................................ 7 Tip 2: Resolution from Illegal States ............................................................................................... 8 Tip 3: RTL Coding Style for State Machine .................................................................................... 8 Tip 4: Use parameters with symbolic state names for State Assignments.................................... 8 Tip 5: Always use absolutely glitch free State Machine ................................................................. 8 Tip 6: Use of common expressions ................................................................................................ 9 Tip 7: Minimum Use of Operators................................................................................................... 9 Tip 8: Optimizing “case” statements............................................................................................. 10 Tip 9: Use state machines wherever possible.............................................................................. 11 Tip 10: Be aware of Prioritization.................................................................................................. 11 Tip 11: Thou shalt use Registers, never Latches......................................................................... 11 Tip 12: Avoid Combinational Loops in Design ............................................................................. 12 Tip 13: Keep possibility for “Design for Test” ............................................................................... 13 Tip 14: Trust not thy simulator – it may beguile thee when thy design is junk ............................ 14 Tip 15: Use of Blocking & Non-blocking Assignments (Verilog Only) .......................................... 14 Tip 16: Fast Integer Multipliers ..................................................................................................... 17 Tip 17: Pipelining .......................................................................................................................... 18 Tip 18: System Clocks vs. Dedicated Clocks............................................................................... 20 Tip 19: Bypass the PLL for easy testing & debugging ................................................................. 23 Tip 20: Close the Gates to Gated Clocks ..................................................................................... 24 Tip 21: Combinational Loops make you go in loops !!................................................................. 26 Tip 22: Use system wide/global reset........................................................................................... 27 Tip 23: Never use Set-Reset flip-flops (SR Flip Flops) in design ................................................ 27 Tip 24: Resolution of Clock Skew Problem .................................................................................. 30 Tip 25: The road from FPGA to ASIC is called “Synchronous Design” ....................................... 32 Tip 26: Use synchronizer to pass control signals between different clock domains ................... 32 Tip 27: Data transfer between two clock domains ....................................................................... 33

Page 5: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 1

1. State Machine Coding Styles:

A state machine is a very important and integral part of every design. A state machine helps in reducing complexity by providing easy control over the design. Hence a state machine is critical to a design and not coding it efficiently, will result in frequency & area problems during synthesis.

Figure 1: FSM Block Diagram

But first we start with the types of state machines known.

(a) Types of State Machines: There are two types of state machines classified according to the types of outputs generated from each. They are:

ª Moore State Machine ª Mealy State Machine

In the Mealy Model, the outputs are function of both the present state and inputs. In the Moore Model, the outputs are a function of the present state only. The two models have been shown in Figure 1 for better understanding. From the figure, we see that y is a function of both x (input) and the present state of state machine in case of Mealy model. In case of Moore model, the outputs are taken only from the flip-flops and are a function of present state only. In a Moore model, the outputs are synchronized to clock but in Mealy model, the outputs may have momentary false values due the delay encountered from the time the inputs change and the time the flip-flop output changes. Thus in a Mealy model, inputs should be synchronized with the clock and the outputs must be sampled only during the clock pulse transition. Apart from the above classification, state machines can be classified according to the state encoding employed by each state machine. They are enumerated as follows:

Next State Decoder

(Combo)

Present State Flip

Flops

(Sequential)

Output Logic

(Combo)

Clock

For Mealy State Machine Only

Outputs (y)

Inputs (x)

Next State

Present State

Present State

Page 6: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 2

ª Binary ª Gray Code ª One-Hot

o One-Hot with zero-idle

This paper deals with synthesizable Verilog code for the above state machines.

i. Binary (Binary-Sequential) Encoded State Machine: In this type of state machine, the states are encoded using Binary values. Each state is assigned a binary value. Thus the relationship between the number of flip-flops (F) required and the number of states (S) in a binary encoded state machine is represented as follows: F = log2 (S) Thus with the number of flip-flops required for a particular binary encoded state machine can be easily determined from the above formula. If we are implementing a 4-state state machine, we require 2 flip-flops to uniquely define the four states. The four states can be encoded as follows: State 1 = “00” State 2 = “01” State 3 = “10” State 4 = “11” Advantage: Less hardware is needed as compared to other encoding schemes. The number of flip-flops required is given by the above equation while in other schemes the number of flip-flop is equal to the number of states in the state machine. Disadvantage: 1. Control logic for binary-sequential encoded state machine is quite complex as it

depends on each state bit as well as the inputs. 2. Unwanted transitions may occur or transition may occur to an unpredictable or dead

state. For example, take a 3-state state machine. The three states are encoded as follows:

State 1 = “00” State 2 = “01” State 3 = “10” Now, if a transition occurs from State 2 to State 3, then it may happen that (due to slow change over of bits and the outputs of the state bits being used asynchronously), a transition may occur temporarily to a state with value “11”. Such a state is known as a dead state. On the other hand, it may also happen that a transition may occur temporarily to State 0 (“00”) that will give unwanted or unpredictable output from the state machine.

Page 7: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 3

ii. Gray Code Encoded State Machine: This is also a type of binary encoded state machine but in this type of state machine, the states are encoded using Gray codes. Each state is assigned a Gray code so that only one bit changes at a time. Thus a Gray encoded state machine overcomes the problem of transition to dead or unwanted states. In this scheme the number of flip-flops required is equal to the number of flip-flops in the binary encoded state machine. The four states can be encoded as follows: State 1 = “00” State 2 = “01” State 3 = “11” State 4 = “10” Advantage: Gray code encoding is especially beneficial when the outputs of the state bits are used asynchronously. No problem of transition to dead, unwanted states. Disadvantage: Control logic for Gray code encoded state machine is also quite complex as it depends on each state bit as well as the inputs.

iii. One-Hot Encoded State Machine: A one-hot encoding scheme uses one flip-flop for each state. Taking the example of a 4-state state machine, each state shall be represented by 4 flip-flops – with only one flip-flop at high logic level at one time. A 4-state state machine can be encoded using this scheme as follows: State 1 = “0001” State 2 = “0010” State 3 = “0100” State 4 = “1000” Advantage: Control logic for One-Hot encoded state machine is also quite simple as the inputs to the state bits are often simply the function of other state bits. Secondly implementation of state machine in One-hot is quite fast since there is one dedicated Flip Flop per state. Disadvantage: This type of state machine encoding requires more hardware than the previously described state machine encoding styles. For example, a 100-state state machine requires 100 flip-flops while a binary/gray encoded state machine would require only 7 flip-flops. One-Hot Encoded State Machine with zero idle:

Page 8: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 4

This state machine is similar to the one-hot encoded state machine except that the IDLE state or the “most frequented” state is assigned zero bit position for better results.

(b) Synthesis Issues: Here we present portion of codes (in Verilog) for ease of understanding. The synthesis results and issues faced there in will be explained after that.

i. Binary-Sequential State Machine code parameter [3:0] IDLE = 2'b00, //Idle state STATE1 = 2'b01, //State 1 STATE2 = 2'b10, //State 2 STATE3 = 2'b11; //State 3 reg [3:0] curr; //Current state of state machine reg [3:0] next; //Next state of state machine always@(posedge clk or negedge reset) begin: State_Changer if (~reset) curr <= IDLE; else curr <= next; end // State_Changer always@(curr or ide_st1 or st1_st2 or st2_st3 or st3_ide) begin:next_state_gen case (curr) IDLE: if (ide_st1) next = STATE1; else next = curr; STATE1: if (st1_st2) next = STATE2; else next = curr; STATE2: if (st2_st3) next = STATE3; else next = curr; STATE3: if (st3_ide) next = IDLE; else next = curr; default: next = curr;

Page 9: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 5

endcase end

ii. Gray Encoded State Machine code parameter [3:0] IDLE = 2'b00, //Idle state STATE1 = 2'b01, //State 1 STATE2 = 2'b11, //State 2 STATE3 = 2'b10; //State 3 reg [1:0] curr; //Current state of state machine reg [1:0] next; //Next state of state machine always@(posedge clk or negedge reset) begin: State_Changer if (~reset) curr <= IDLE; else curr <= next; end // State_Changer always@(curr or ide_st1 or st1_st2 or st2_st3 or st3_ide) begin:next_state_gen case (curr) IDLE: if (ide_st1) next = STATE1; else next = curr; STATE1: if (st1_st2) next = STATE2; else next = curr; STATE2: if (st2_st3) next = STATE3; else next = curr; STATE3: if (st3_ide) next = IDLE; else next = curr; default: next = curr; endcase end

Page 10: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 6

iii. One-Hot State Machine code parameter [3:0] IDLE = 4'd0, //Idle state STATE1 = 4'd1, //State 1 STATE2 = 4'd2, //State 2 STATE3 = 4'd3; //State 3 reg [3:0] curr; //Current state of state machine reg [3:0] next; //Next state of state machine always@(posedge clk or negedge reset) begin: State_Changer if (~reset) curr <= 4’b0001; else curr <= next; end // State_Changer always@(curr or ide_st1 or st1_st2 or st2_st3 or st3_ide) begin:next_state_gen next = 4’b0000; case (1’b1) curr[IDLE]: if (ide_st1) next[STATE1] = 1’b1; else next[IDLE] = 1’b1; curr[STATE1]: if (st1_st2) next[STATE2] = 1’b1; else next[STATE1] = 1’b1; curr[STATE2]: if (st2_st3) next[STATE3] = 1’b1; else next[STATE2] = 1’b1; curr[STATE3]: if (st3_ide) next[IDLE] = 1’b1; else next[STATE3] = 1’b1; default: next = curr; endcase end Synthesis Results: The above state machines encoding styles were used to code in Verilog, two state machine having 4 & 25 states. The state machines were synthesized

Using next[IDLE] = 1’b1would reduce the size of case statement from 4-bits to single bit. In default or error case, state machine can go to idle.

Page 11: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 7

and the results are shown in the graphs below. It is clearly visible that a state machine with less states gives better result when encoded using one-hot encoding style while, a state machine with more states gives better results when encoded using binary/gray encoding. Figure 2 shows the graphs for each state machine for LC Count/Frequency vs. the type of encoding used.

Figure 2: Synthesis results for different state machine encoding styles

(c) Choosing a State Machine Encoding Style:

Tip 1: Choosing an Encoding Style

One should choose a state machine encoding style depending on the complexity of the state machine and resolution from illegal/dead states.

When choosing a state machine encoding style, one must consider the number of potential illegal states the state machine can enter into. A design might end up in a dead state if setup or hold times of the state-bit flip-flops are violated and one has not defined all the states. For example, a 14-state state machine can be implemented in Binary/Gray using 4 flip-flops. But there are 2 possible illegal states to which transition can occur. One-hot encoded state machines however have more potential illegal states. The number of illegal states is determined by the equation (2n) – n, where n equals the number of states in the state machine. Thus a 14-state state machine has 16,370 possible illegal states. But a 14-state state machine requires 14 flip-flops using the one-hot encoding scheme.

State Machine with 4 states

11 12

9 8

0

5

10

15

LC Count

Binary

Gray

One-Hot

One-Hot with zero Idle

State Machine with 4 states

196.7

189.3

203.5201.1

180

185

190

195

200

205

Freq (MHz)

MH

z

Binary

Gray

One-Hot

One-Hot with zero Idle

State Machine with 25 states

32 35

108 102

0

50

100

150

LC Count

Binary

Gray

One-Hot

One-Hot with zero Idle

State Machine with 25 states

213.3 202.4

115.5 130.4

050

100150

200250

Freq

MH

z

Binary

Gray

One-Hot

One-Hot with zero Idle

Page 12: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 8

Tip 2: Resolution from Illegal States

Thus as long as the setup or hold times of the state-bit flip-flops are not violated, the state machine will not enter an illegal state. Recovering from a dead state can be achieved if we keep a transition from illegal state to a legal one while writing the code.

(d) Guidelines:

Tip 3: RTL Coding Style for State Machine

Always make each state machine a separate module. Keeping each state machine separate from other synthesized logic simplifies the tasks of state machine definition, modification and debug. It is easy to manage a state machine kept as a separate module in the design.

Tip 4: Use parameters with symbolic state names for State Assignments

Defining and using symbolic state names makes the Verilog code more readable and eases the task of redefining states if necessary. State Assignment using parameters with symbolic state names have been shown in Section (b) “Synthesis Issues”.

Tip 5: Always use absolutely glitch free State Machine

Figure 2 shows a much better design for a state machine. By adding an output register (with cleanly clocked D-type flip-flops) that is reloaded at each clock edge, the outputs of the state machine are guaranteed to be always glitch-free. It is suggested that all state machines be implemented in this form, since the quality of the outputs is independent of the number of states or outputs.

Figure 3: A much better State Machine

Inputs (x)

Present State

Outputs (y)

Next State Decoder

(Combo)

Present State Flip

Flops

(Sequential)

Clock

Next State

Output Generation

Logic

(Sequential)

Clock

Page 13: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 9

2. Efficient Coding/Design Tips: Tip 6: Use of common expressions

Identify and factor out common sub-expressions in relational operators to reduce component count. Example: if (x > a + b + c) statement1; if (y > a + b + d) statement2; if (z > a + b + e) statement3; Here the expression (a + b) can be factored out and their sum can be used in the “if” statements. This reduces component count and hence LC count. sum = a + b if (x > sum + c) statement1; if (y > sum + d) statement2; if (z > sum + e) statement3;

Tip 7: Minimum Use of Operators

The number of operators used in a process should be kept at minimum. Each operator that is used contributes to hardware (adder, subtractor, multiplier or divider) and hence they must be used wisely. Example: case (sel) 2’b00: y = a + b; 2’b01: y = a + c; 2’b10: y = a + d; 2’b11: y = a + e; default: y = a + 1’b1; endcase In the above example, depending on “sel”, y is assigned the sum b or c or d or e with a. This piece of code would implement 4 adders. This has been shown in the figure

+

+

+

+

y

a b

a c

a d

a e

sel

Page 14: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 10

alongside the code. These adders are redundant and can be reduced. Thus, the above code can be modified as follows: case (sel) 2’b00: temp = b; 2’b01: temp = c; 2’b10: temp = d; 2’b11: temp = e; default: temp = 1’b1; endcase y = a + temp; In the above example, only one adder is implemented. Thus operators must be used wisely in the code. The hardware can be seen from the diagram alongside the code.

Tip 8: Optimizing “case” statements

“Case” statements can be optimized to reduce the size of multiplexer used. Consider the following example: case (sel) 5’b00010: temp = b; 5’b01000: temp = c; 5’b01100: temp = d; default: temp = 1’b0; endcase The above case statement would implement a huge 32:1 mux. Now if we see, temp is assigned values for only three values of “sel”. Thus we can optimize the above case statement as follows: assign sel_00010 = ~sel[4]&& ~sel[3]&& ~sel[2]&& sel[1]&& ~sel[0] assign sel_01000 = ~sel[4]&& sel[3]&& ~sel[2]&& ~sel[1]&& ~sel[0] assign sel_01100 = ~sel[4]&& sel[3]&& sel[2]&& ~sel[1]&& ~sel[0] case (1’b1) sel_00010: temp = b; sel_01000: temp = c;

+ y

sel

e

d

c

b

a

Page 15: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 11

sel_01100: temp = d; default: temp = 1’b0; endcase In the above piece of code, a 8:1 mux is formed if we consider each signal formed from “sel” as one bit. During synthesis, this piece of code has been found to take significantly less LCs and give better frequency that the 32:1 mux.

Tip 9: Use state machines wherever possible

Whenever a piece of control logic goes out of hands, it is better to replace it with a state machine. Consider an arbiter, that processes incoming data packets and send them out according to priority. Since the process of arbitration is well defined, it would be easier and better to implement a state machine rather than making the control logic without state machine. Upon implementing the state machine, it was found that the control logic was easy to understand and gave the required results in synthesis & simulation on the first go !!

Tip 10: Be aware of Prioritization

This tip is as simple as it sounds. One should always be clear in his mind whether to use “if-else-if” or “case” statement. This can be decided on the basis of priority of signals. In case there is no priority, then a “case” statement should be used. If priority is to be established, it should be done through “if-else-if” statements. But while using nested “if” statements, one must take care that they do not get too long and give long path delays.

Tip 11: Thou shalt use Registers, never Latches

Latches are always harmful for the design. Use registers instead. Latches are a handicap for the design. An incomplete “if” or “case” statement (not in a clocked process) will always generate a latch. Such a latch is called an Inferred Latch. To avoid a latch, it best to use an incomplete “if” or “case” statement in a clocked process.

always@(cond1 or d) begin if (cond1) q = d; end The above code shall give an “Inferred Latch” warning during synthesis. To avoid latches, either give fully specified “if” statement or use a clocked process.

Incomplete “if” statement… Latch Inferred !!

Page 16: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 12

always@(cond1 or d) begin if (cond1) q = d; else q = 1’b0; end

OR always@(posedge clk or negedge reset) begin if (~reset) q <= 1’b0; else if (cond1) q <= d; end

Tip 12: Avoid Combinational Loops in Design

Combinational loops occur when the generation of a signal depends on itself through several combinational always blocks. Combo loops are a hazard to a design and synthesis tools will always give errors when combo loops are encountered. The generation of combo loops can be understood from the following bubble diagram. Each bubble represents a combo always block and the arrow going into it represents the signal being used in that always block while an arrow going out from the bubble represents the output signal generated by that output block. The code and the bubble diagram are given below: always@(a) begin b = a; end always@(b) begin c = b; end always@(c) begin d = c; end always@(c) begin

Complete “if” statement… No latch formed.

Incomplete “if” statement in clocked process. No latch formed.

always

always

always

a

b c

d

Page 17: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 13

a = c; end In order to remove combo loops, one must change the generation of one of the signals so the dependency of signals on each other is removed. Thus, suppose the generation of signal c depends on two conditions, say, condition1 & condition2. Thus its generation can be modified as follows: always@(posedge clk or negedge reset) begin if (~reset) c <= 1’b0 else if (condition1) c <= 1’b1; else if (condition2) c <= 1’b0; end Thus, now we have broken the combo path by introducing a flip-flop. This removes combo loops from the design.

Tip 13: Keep possibility for “Design for Test”

One must always keep test ports or self-test features that will help the designer to test his chip once fabricated. By keeping test ports, a designer can tap the internals of his ASIC/FPGA and debug any problem that might occur during the testing or on-board running of the ASIC/FPGA. A designer can keep a dedicated bus for test port or re-use some of the existing busses. The chip must be configurable to turn-on or turn-off the test mode configuration.

Figure 4: Test Mode Setup

Register Space

Decoder

Core

Debug Signals

32-bit bus

Test Mode Pins

Page 18: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 14

A basic configuration of a test mode support in a core is shown in the above figure. Here re-use has been made of the 32-bit bi-directional bus to act as the output for debug signals with the help of a tri-state buffer. A similar setup can be used but with a dedicated bus for output of the debug signals.

Tip 14: Trust not thy simulator – it may beguile thee when thy design is junk

It is easy and tempting to say “I’ll just design it quickly, then find the bugs in simulation”. This is a bad idea and is doomed from the start. Simulators are notorious for hiding the quirky details of your design. Examples include: ª Clock Synchronization Synchronizing flip-flops constantly battle metastability and glitching inputs. The average simulator does not even closely approximate their behavior; all you see is a clean transition at the clock edge. Crossing clock domains must always be correct by design from the earliest stages. ª Asynchronous Logic In a similar way, asynchronous logic is often simulated poorly. Certainly, fast paths and race conditions may be hidden. Some environments will determine (and optionally correct) hold-time violations, but this is not a universal panacea for correct asynchronous logic. Note: Correct by Design and Correct by Inspection When designing logic that is outside the protected realm of clock-to-clock register-to-register implementations, the only solution for robust design is to do it right from the start. Your logic must be: ª Correct by Design Each gate, each line of VHDL or Verilog, must be understood completely. Don’t hope that some set of simulations will find your bugs; you may neglect to test a part of your design, and if it was designed sloppily, it will fail. ª Correct by Inspection Disciplined layout will also make your design more robust, comprehensible, and maintainable. It should not be necessary to sort through a mass of ugly code or spaghetti gates to understand the operation of your function. Organized gates, commented code, and thorough accompanying documentation will provide a basis for a reliable design.

Tip 15: Use of Blocking & Non-blocking Assignments (Verilog Only)

Blocking & Non-Blocking Statements: Two well-known Verilog coding guidelines for modeling logic are: ª Use blocking assignments in always blocks that are written to generate combinational

logic ª Use non-blocking assignments in always blocks that are written to generate sequential

logic

Page 19: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 15

But why? In general, the answer is simulation related. Ignoring the above guidelines can still infer the correct synthesized logic, but the pre-synthesis simulation might not match the behavior of the synthesized circuit. A Verilog race condition occurs when two or more statements that are scheduled to execute in the same simulation time-step, would give different results when the order of statement execution is changed. To understand the reasons behind the above guidelines and to avoid race conditions, one needs to have a full understanding of the functionality and scheduling of Verilog blocking and non-blocking assignments. Blocking Assignments: The blocking assignment operator is an equal sign ("="). A blocking assignment gets its name because a blocking assignment must evaluate the RHS arguments and complete the assignment without interruption from any other Verilog statement. The assignment is said to "block" other assignments until the current assignment has completed. The one exception is a blocking assignment with timing delays on the RHS of the blocking operator, which is considered to be a poor coding style. Execution of blocking assignments can be viewed as a one-step process: 1. Evaluate the RHS (right-hand side equation) and update the LHS (left-hand side expression) of the blocking assignment without interruption from any other Verilog statement. A blocking assignment "blocks" trailing assignments in the same always block from occurring until after the current assignment has been completed A problem with blocking assignments occurs when the RHS variable of one assignment in one procedural block is also the LHS variable of another assignment in another procedural block and both equations are scheduled to execute in the same simulation time step, such as on the same clock edge. If blocking assignments are not properly ordered, a race condition can occur. When blocking assignments are scheduled to execute in the same time step, the order execution is unknown. To illustrate this point, look at the Verilog code

module fbosc1 (y1, y2, clk, rst); output y1, y2; input clk, rst; reg y1, y2; always @(posedge clk or posedge rst) if (rst) y1 = 0; // reset else y1 = y2; always @(posedge clk or posedge rst) if (rst) y2 = 1; // preset else y2 = y1;

Page 20: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 16

endmodule According to the IEEE Verilog Standard, the two always blocks can be scheduled in any order. If the first always block executes first after a reset, both y1 and y2 will take on the value of 1. If the second always block executes first after a reset, both y1 and y2 will take on the value 0. This clearly represents a Verilog race condition. Non-Blocking Assignments: The non-blocking assignment operator is the same as the less-than-or-equal-to operator ("<="). A non-blocking assignment gets its name because the assignment evaluates the RHS expression of a non-blocking statement at the beginning of a time step and schedules the LHS update to take place at the end of the time step. Between evaluation of the RHS expression and update of the LHS expression, other Verilog statements can be evaluated and updated and the RHS expression of other Verilog non-blocking assignments can also be evaluated and LHS updates scheduled. The non-blocking assignment does not block other Verilog statements from being evaluated. Execution of non-blocking assignments can be viewed as a two-step process: 1. Evaluate the RHS of non-blocking statements at the beginning of the time step. 2. Update the LHS of non-blocking statements at the end of the time step. Non-blocking assignments are only made to register data types and are therefore only permitted inside of procedural blocks, such as initial blocks and always blocks. Non-blocking assignments are not permitted in continuous assignments. To illustrate this point, look at the Verilog code

module fbosc2 (y1, y2, clk, rst); output y1, y2; input clk, rst; reg y1, y2; always @(posedge clk or posedge rst) if (rst) y1 <= 0; // reset else y1 <= y2; always @(posedge clk or posedge rst) if (rst) y2 <= 1; // preset else y2 <= y1; endmodule

Again, according to the IEEE Verilog Standard, the two always blocks can be scheduled in any order. No matter which always block starts first after a reset, both non-blocking RHS expressions will be evaluated at the beginning of the time step and then both non-blocking LHS variables will be updated at the end of the same time step. From a users perspective, the execution of these two non-blocking statements happens in parallel.

Page 21: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 17

Conclusion: Thus coding of sequential always block with blocking assignment leads to race condition. To prevent this, one must code sequential logic with non-blocking assignments no matter how simple they may be and combinational logic must be coded using blocking assignments. Using non-blocking assignment for coding combinational logic would work, but if multiple assignments were to be made using non-blocking assignments, then this would lead to incorrect simulation results, or requiring additional sensitivity list entries and multiple passes through the always block to simulate correctly. Consider the following code,

module ao4 (y, a, b, c, d); output y; input a, b, c, d; reg y, tmp1, tmp2; always @(a or b or c or d) begin tmp1 <=a &b; tmp2 <=c &d; y <= tmp1 | tmp2; end endmodule

Since non-blocking assignments evaluate the RHS expressions before updating the LHS variables, the values of tmp1 and tmp2 were the original values of these two variables upon entry to this always block and not the values that will be updated at the end of the simulation time step. The y-output will reflect the old values of tmp1 and tmp2, not the values calculated in the current pass of the always block. Now if we add tmp1 and tmp2 to the sensitivity list, this will cause the always block to give correct output, but after taking two passes through the always block. This is known as Multiple Passes and these lead to degraded simulation performance and should be avoided. Also, one must not mix blocking and non-blocking assignments in an always block even though Verilog permits this. Some synthesis tools will give an error and this type of coding practice discouraged.

Tip 16: Fast Integer Multipliers

Digital multipliers are needed in many system applications, including digital filters, correlators, and neural networks. These multipliers typically are required to handle operands of up to 16 bits, and need to provide results in less than 50 ns (20 MHz systems). Digital multipliers generally are considered too slow when implemented in FPGAs, or too large to make effective use of a programmable part. As an alternative,

Page 22: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 18

dedicated multiplier devices are connected to the main system, often resulting in a performance degradation caused by the delays inherent in getting the data to and from the multiplier device. Thus, techniques for implementing a compact and fast digital multiplier in a programmable part are needed. There are three main implementations of digital multipliers:

1. Shift and Add — One operand is shifted to the left by one bit each cycle and applied

to an accumulator when the corresponding bit in the second operand is high. 2. Look-up table — The operands are applied as addresses to a pre-programmed

memory that outputs the result. 3. Logical tree — Each of the resultant bits are a logic function of the relevant bits of

each operand. Using a 16-bit x 16-bit multiplier as an example, let’s consider each technique. The shift and add implementation is compact but very slow. The result is obtained after 16 clock cycles and the accumulator must be 32-bits wide. The accumulator will determine the maximum clock rate dependent on the carry logic chain. This implementation also precludes any new operands from being applied until the calculation has been completed, read, and cleared. The speed of the look-up table solution depends on the speed of the memory used, but rapidly becomes unwieldy as operand size increases. This 16x16 example requires a 4,294,967,296 x 32-bit memory! Small multipliers work well this way, such as a 4-bit x 4-bit multiply implemented in a byte-wide ROM. Some very complex implementations of logical trees have been developed employing product sharing, and are to be found in many dedicated arithmetic devices. The gate count is high and can be considered a reduced version of the ROM table. The logic involved tends to have high fan-in requirements (up to 32 inputs).

Tip 17: Pipelining

Pipelining can dramatically improve device performance by restructuring long data paths with several levels of logic and breaking them up over multiple clocks. This method allows for a faster clock cycle and increased data throughput at small expense to latency from the extra latching overhead. Because FPGAs are register-rich, this is usually an advantageous structure for FPGA design since the pipeline is created at no cost in terms of device resources. However, since the data is now on a multi-cycle path, special considerations must be used for the rest of the design to account for the added path latency. Care must be taken when defining timing specifications for these paths. The ability to constrain multi-cycle paths with a synthesis tool varies based on the tool being used. Check the synthesis tool's documentation for information on multi-cycle paths. Note – It is recommend that careful consideration before trying to pipeline a design. While pipelining can dramatically increase the clock speed, it can be difficult to do correctly. Also, since multicycle paths lend themselves to human error and tend to be more troublesome due to the difficulties in analyzing them correctly, they are not generally recommended for reusable modules.

Page 23: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 19

In a design with multiple levels of logic between registers, the clock speed is limited by the clock-to-out time of the source flip-flop, plus the logic delay through the multiple levels of logic, plus the routing associated with the logic levels, plus the setup time of the destination register. Pipelining a design reduces the number of logic levels between the registers. The end result is a system clock that can run much faster. Verilog Example before Pipelining

module no_pipeline (a, b, c, clk, out); input a, b, c, clk; output out; reg out; reg a_temp, b_temp, c_temp; always @(posedge clk) begin out = (a_temp * b_temp) + c_temp; a_temp = a; b_temp = b; c_temp = c; end endmodule

Figure 5: No Pipelining

Verilog Example after Pipelining

module pipeline (a, b, c, clk, out); input a, b, c, clk; output out; reg out; reg a_temp, b_temp, c_temp1, c_temp2, mult_temp; always @(posedge clk) begin mult_temp = a_temp * b_temp;

Page 24: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 20

a_temp = a; b_temp = b; end always @(posedge clk) begin out = mult_temp + c_temp2; c_temp2 = c_temp1; c_temp1 = c; end endmodule

Figure 6: Pipelining

3. Clock Management

In the system level designing, the generation, synchronization and distribution of the clocks is essential. FPGAs support a limited number of designed clock trees, each of which drives a fixed set of flip-flops. On the other hand, ASIC supports any number of clocks with clock tree synthesis. Each clock tree is synthesized to drive a specific set of flip-flops, thus giving the best performance with minimal skew and power consumption. FPGAs have a pre-tested clock distribution mechanism built in the device itself. It provides high fanout with low skew throughout the chip. These clocking resources inside the FPGA are roughly equivalent to the high power buffers found in SoC designs. FPGAs designed for SoRC on average provide 4 dedicated global clock resources.

Tip 18: System Clocks vs. Dedicated Clocks

It is recommended to keep the number of system clocks equal to or less than the number of dedicated global clock resources available in the targeted FPGA.

Page 25: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 21

4. Requirement of PLL/DLL

In FPGA, as the design grows in size, quality of clock distribution becomes poor. In FPGAs normally, a clocking method is used to reduce the impact of clock skew and clock delay. FPGA architectures normally provide either a dedicated PLL (phase locked loop) or Delay-Locked Loop (DLL) circuits. These circuits not only remove the clock delay but also provide additional functionality like clock division and clock multiplications that are quite useful in design having multiple clocks.

Now lets explore more about DLL/PLL

(a) Delay Lock Loops (DLL) A DLL consists of programmable delay line and simple control logic as shown below in the figure. Delay controller samples the input clock CLK_IN as well as the feedback clock CLK_FBK in order to adjust the delay line. Clock distribution network distributes the clock to all the internal Flip-flops and to the clock feedback pin CLK_FBK. The delayed line produces the delayed version of the input clock.

i. Working of a DLL A DLL works by inserting the delay between the input clock and the feedback clock until the two rising edge align, making the two clocks in phase, and when the edges coincide, the DLL “locks”. After the DLL locks the two clock have no phase difference. Thus the delay output clock compensates for the delay in clock distribution network. This insures that the clock edges arrive at internal flip-flops in synchronism with each clock edge arriving at the input.

Clockdistribution

network

P r o g r a m m a b l e D e l a y

LineCLK_IN CLK_OUT

D e l a y

C o n t r o l l e rCLK_FBK

Figure 7: Working of a DLL

Page 26: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 22

Lets see how it works by taking an example. Assume that due to the delay in clock distribution network CLK_FBK (Feedback clock) is 180 degrees out of phase with the CLK_IN clock as shown below.

CLK_IN

CLK_FBK

Figure 8: 180 degree phase difference between Input clock and feedback clock

As seen above the clock distribution network produces a 180-degree phase delay. Assume initially CLK_OUT = CLK_IN with delaying line being bypassed. Now if we program the delay line so that it produces a 180-degree phase delay of the input clock (CLK_IN) at its output (CLK_OUT), we have the following timing. As seen in the figure, clock distribution network again produces a 180-degree phase delay of the CLK_OUT, so that the final feedback clock is in phase with the input clock (CLK_IN).

CLK_IN

CLK_OUT

CLK_FBK

Input clock (CLK_IN) and

Feedback clock (CLK_FBK)

with zero phase delay

Figure 9: Feedback clock made in phase with the Input clock

Page 27: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 23

Note: Delay programmed is equal to the phase difference between the feedback clock and the input clock. DLL can also provide multiple phases of a single clock to achieve higher frequencies.

(b) Phase Locked Loop (PLL)

PLL uses a different architecture to accomplish the same task. Instead of Delay Line, PLL uses a programmable oscillator to generate a clock signal that approximates the input clock CLK_IN. The PLL control logic compares the input clock to the feedback clock CLK_FBK and adjusts the oscillator clock until the rising edge of the input clock aligns with the rising edge of the feedback clock. The PLL then “locks”. The Altera FLEX 20KE is an example of a FPGA architecture that contains a clock management system with phase-locked lock (PLL). Here above, the control logic consists of a phase detector and filter that adjusts the oscillator phase to compensate for the clock distribution delay.

Tip 19: Bypass the PLL for easy testing & debugging

If a phase-locked loop (PLL) is used for on-chip clock generation, then some means of disabling or bypassing the PLL should be provided. This bypass makes chip testing and debug easier.

5. Gated Clocks Avoid using gated clocks in the design. Since using the same might turn your system unstable. Simulating your gated clock design might work perfectly fine but the problem comes when it is synthesized.

F/FCLK_INFF

2 InputAND

CLK_EN

CLK_IN

D Q

Figure 10: Gated Clocks

Figure above shows a typical AND gate placed at the clock’s input (CLK_INFF shown above) of a Flip Flop. Assume CLK_IN to be the system’s global clock which is being routed throughout the design, this configuration is very sensitive to both glitches and simultaneous switching inputs on the AND gate. As shown in the timing diagram below a propagation delay td is introduced in clock input to the FF (CLK_INFF) shown. So the data output of the F/F will be in synchronism with CLK_INFF instead of global clock (CLK_IN).

Page 28: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 24

Finally in our design some F/F will be triggered on global clock and some on CLKINFF that will finally result in synchronization problems.

CLK_IN

C L K _ E N

CLK_INFF

Delay in clock due topropogat ion delay in AND gate

td

Figure 11: Propagation Delay in Gated Clocks

A simple and safe alternative to the above problem is to use clock enable flip-flops as shown below.

F/F

CLK_EN

CLK_IN

D Q

Enable

Figure 12: Clock Enable Flip-Flops

Here the system clock (CLK_IN) is directly connected to the clock pin on the flip-flop. Now when the clock is enabled via. CLK_EN, new data presented to the flip-flop is reflected at its output else the flip-flop retains the previous data (when CLK_EN is “low”). This makes whole design synchronous to a single clock.

Tip 20: Close the Gates to Gated Clocks

Avoid using Gated clocks as they tend to make the design asynchronous and are very sensitive to glitches or simultaneously switching inputs on the AND gate for example shown above.

Page 29: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 25

6. Problem with latches Latches should be avoided whereas possible in the design, flip-flops should be used instead. As seen below in the figure, if both the X and Y were to go low, and since these are level triggered, both the Latches would be enabled resulting in the circuit to oscillate.

D QD Qcombinational

circuit(tc2

)combinational

circuit (tc1

)

XY

Figure 13:Race condition in Latches

So we have the following conclusion that latches cannot be used in sequential circuit with feedback paths since they will cause racing problems. Ø Static timing analyzers typically make incorrect assumptions about latch transparency,

and either find a false timing path through the input data pin or miss a critical path altogether.

Ø Latches tend to make circuits less testable. Most design for test (DFT) and automatic test program generator (ATPG) tools do not handle latches very well.

Synthesis tools occasionally infer a latch in a design when one is not intended. Inferred latches typically result from incomplete "if" or "case" statements. Lets see this by taking an example:-

a l w a y s @ ( c l k )

b e g i n

i f ( c l k ) then

a = 1 ' b 1 ;

b = 1 ' b 0 ;

e l se

b = 1 ' b 1 ;

e n d

'0'

'1' 0

1

b

clk

QD'1' a

Figure 14:Inferred Latch due to incomplete ‘if else’ statement

Page 30: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 26

In this example, b will be synthesized as straight combinational logic while a latch will be inferred on signal a. A general rule for latch inferring is that if a variable is not assigned in all possible executions of an always statement (for example, when a variable is not assigned in all branches of an if statement), then a latch is inferred.

Ø Some FPGA architectures do not support latches. When such a design is synthesized,

the synthesis tool creates a combinational feedback loop instead of a latch. Following hardware will be generated for the previous code (Figure 12), in case the FPGA architecture do not support latches.

'0'

'1' 0

1

b

clk

'1'

a

1

0

Figure 15:Combinational loop implemented due to incomplete ‘if-else’ statement

Combinational feedback loops as shown above are capable of latching data but pose more problem then latches since they may violate setup, hold requirements which are difficult be determined, whereas latches does not have any setup time, hold time violations since they are level triggered.

Tip 21: Combinational Loops make you go in loops !!

The design should not contain any combinational feedback loops. They should be replaced by flip-flops or latches or be eliminated by fully enumerating RTL conditionals.

7. Using reset correctly All the register elements in the design should be reset using a system wide/global reset because it establishes an initial state and initializes all the register elements.This makes logic simulation significantly easier and avoids finite state machine (FSM) latch-up (dead) states.

Page 31: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 27

Tip 22: Use system wide/global reset

Resets can be either Synchronous or asynchronous. Most FPGA devices have a built-in power-on-reset (POR) function and dedicated reset signals for the flip-flops. These should be used efficiently. If your design contains any asynchronous resets, which work independently from the clock, you should them since they put the entire design in a known state.

Tip 23: Never use Set-Reset flip-flops (SR Flip Flops) in design

Designers should not use SR (set reset) flip-flops in their design since circuit behavior is unpredictable when both set and reset are asserted at the same time. Shown a code below which uses SR flip flop.

always @(posedge p_glbclk or negedge p_reset_n)

begin

if (~p_reset_n)

output_ff <= 1'b0; else

begin

if (set_cond)

output_ff <= 1'b1;

if (reset_cond)

output_ff <= 1'b0; end

end

S

R

set_cond

reset_cond

F/F

Q output_ff

reset

p_glbclk

p_reset_n

Figure 16: Set Reset Flip Flop used in Design

Use this instead: -

Page 32: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 28

always @(posedge p_glbclk or negedge p_reset_n)

begin

if (~p_reset_n)

output_ff <= 1'b0;

else

if (set_cond)

output_ff <= 1'b1;

else if (reset_cond)

output_ff <= 1'b0;

end

D

Clear

set_cond

reset_cond

F/F

Q output_ff

reset

p_glbclk

p_reset_n

Set

Figure 17: Set Reset pins of Flip Flop used instead of SR Flip Flop

In the above example, D flip flop is used instead of SR, its ‘set_cond’ and ‘reset_cond’ being driven by synchronous set and synchronous clear of the flip-flop respectively.

8. Clock skew problem Tiny differences in propagation delay, when compounded across all the clock nets in a complex digital product, often lead to unacceptable degradations in overall system-timing margins. This generic problem is often referred to as the "clock skew" problem. As shown below in the figure a small wire delay ‘w’ is inserted so that the clock on which the second Flip flop is triggered (CLKB) is a slightly delayed version of the clock on which the first Flip flop is triggered (CLKA).

Page 33: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 29

QDtc

C L K AC L K

B

w i r e d e l a y ' w '

w

C L KA

C L KB

QD

A B

A B

A '

Figure 18: Problem of Clock Skew

Lets look into more detail how this clock skew problem affects clock period and hold time constraints. If the clock to output delay of F/FA (tCQ) is less than the propagation/wire delay (tw), then the input data presented to F/FA which is supposed to be triggered at edge B of F/FB shown below in Figure 17 will trigger at A’ instead, which is not required.

Page 34: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 30

ts

tCQ

tw

B

A'

Here tCQ

< tW

F/FB is triggered at A' instead of B

CLK A

CLK B

Figure 19: Malfunctioning of Flip Flops due to clock skew

This problem of clock skew can be resolved in following 2 ways:-

Tip 24: Resolution of Clock Skew Problem

1. By inserting a dummy delay (tC) between the F/FA and F/FB Figure 16 so that now we have tCQ + tC > tW and so F/FB is triggered correctly at B instead of at A’ when the data is presented to F/FA on edge A.

2. Apply the clock in the reverse direction w.r.t. to the data so that the skew is

automatically eliminated.

QD QDDATA

CLK

DATA

CLK

Figure 20: Resolution of Clock Skew Problem

Page 35: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 31

(a) Effect of Clock Skew on Max. Frequency Consider the following parameters of the F/FA and F/F B shown in Figure 19 (ts)A à Setup time for F/FA (tH)A à Hold time for F/FA (tCQ)A à Clock to output time for F/FA (this includes Hold time) (ts)B à Setup time for F/FB (tH)B à Hold time for F/FB (tCQ)B à Clock to output time for F/FB (this includes Hold time) tc à combinational delay between F/FA and F/FB tw à propagation delay due to wire Ignoring the delay time due to wire (tw=0), we have the following CLKA = CLKB = CLK The above parameters are shown below in the figure w.r.t to CLK.

(ts)A

(tCQ

)A

(ts)B

(tCQ)B

CLK

Figure 21: Flip Flop Parameters

Since Max. Frequency is inverse of the time period between two register elements; we have the following relationship (ignoring tw)

Fmax(ignoring tw) = 1 / ( (tCQ)A + tC + (tS)B) Eq. (1) Now if we consider the propagation delay due to wire(tw), then setup time of the F/FB includes tW and this value is then supposed to be subtracted from the Eq. (1):-

Fmax(with tw) = 1/( (tCQ)A + tC + (tS)B – tW) Thus clock slew results in increasing the max frequency.

9. Synchronous Design Synchronous designs are characterized by a single master clock and a single master set/reset driving all sequential elements in the design.

Page 36: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 32

Tip 25: The road from FPGA to ASIC is called “Synchronous Design”

It is highly recommended that design must be synchronous if one wants to migrate from FPGA to ASIC as this leaves maximum clock frequency, input set-up and hold time, and clock to output timing as the only timing issues. In a synchronous design ,all input signals are synchronized to the clock in such a way that they never violate set-up and hold time requirements.

10. Multiple Clock Domain Exchanging information between two independent system clock domains can be treacherous if you have two or more clock domains. You must assume the data exported from the first will be asynchronously received by the second.

Figure 22: Dual Clock Domain

Tip 26: Use synchronizer to pass control signals between different clock domains

All control signals crossing the clock domain must pass through a set of synchronizers to avoid metastability conditions. Exchanging information among two independent system clock domains must be done very carefully to avoid losing or corrupting data during a transfer. The two clocks may run at different frequencies, and may even stop. One of the reliable ways to pass data is to use handshake signaling.

System X System Y

xc lk yc lk

D A T A

Page 37: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 33

System X System Y

xclk yclk

xreq

yack

D A T A

Figure 23:Handshaking signaling

Figure above shows data transfer through handshaking signaling. The transmitting clock domain (System X) sends both data and a data-ready signal (“xreq” above). The receiving clock domain (System Y) captures the incoming data when it sees the data-ready signal and, after the data is safely captured, responds with an acknowledgement (“yack” above). The transmitter waits until it sees the acknowledgement before starting the cycle over again. One of the disadvantages of using handshaking signaling is that the clock latency is quite high for a single data transfer. Alternative way is to pass the data is through Asynchronous FIFO. This method ensures reliable data transfer with comparatively low clock latency but uses more system resources.

Tip 27: Data transfer between two clock domains

Data transfer between different clock domains must be done either through handshaking signaling or using Asynchronous FIFO to ensure reliable data transfer.

11. Designing for Portability (FPGA to ASIC) An Engineer should assume that his work may one day go into ASIC and should design accordingly. This section covers most of the issues involved in a design when migrated from FPGA to ASIC. The following are some of the points one should look at while migrating to ASIC. 1. A successful FPGA design may contain asynchronous components such as loops,

delay buffers etc that are unlikely to work in an ASIC migration. So it is best to keep the whole design synchronous.

Page 38: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 34

2. Use only one external clock and avoid gating it or generating additional internal clocks from it.

3. Use a single external reset. 4. Consider all possible values in your decoding logic and state machine and provide a

path from each unused option to a known initial state.(Section 1) 5. Synchronize inputs to the clock.(Section 9) 6. Avoid latches and combinational feedback loops.(Section 6) 7. If the design contains multiple clocks, they should be used with care. All signals

crossing the clock domain must pass through a synchronizers.(Section 10) 8. Use resets carefully (Section 7) 9. Use DLL/PLL to improve performance (Section 4 ) 10. All FPGAs contains a dedicated portion of memory that can be used for implementing

CAM (Content addressable Memory), RAM and ROM etc. Avoid using these specific features of FPGA since these might pose a problem when migrating from FPGA to ASIC.

11. Avoid using proprietary IP. The above points can be used as a checklist when migrating from FPGA to ASIC. The next section details about implementing internal memories in FPGA, by taking a specific example of ALTERA FPGA.

12. Implementing Internal Memories in FPGA The FPGAs are generally organized into the Logic Cells (or, Logic Elements) and memory portions(Embedded system Blocks ESBs in ALTERA FPGAs). Generally, the logic elements such as Flipflops, muxes etc are targeted to LCs. They are scarce in the FPGA. The abundance of the ESBs enables implementation of multiple wide memory blocks for high-density designs. The high speed of the ESB ensures it can implement the small memory blocks without any speed penalty. Note: Usage of internal memory achieve significant gain in area and also solves timing issues.

Figure 24: Section of Internal Architecture of ALTERA FPGA

Page 39: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 35

This section is aimed at giving the reader an insight into how to use the FPGA more efficiently by implementing the pieces of logic in the ESBs(Internal memory bits in ALTERA FPGAs). We shall be taking the ALTERA FPGAs as a representative of FPGAs and shall discuss the concepts with regards to those FPGAs. This shall give an insight to typical usage and better implementation schemes on the FPGAs. With some FPGA/tool specific tailoring this approach shall work on any FPGA. The ESBs may be used to implement content-addressable memory (CAM), RAM, and ROM in Altera devices. The generic, scalable nature of each of these functions ensures that you can use them to implement any supported type of CAM, RAM, FIFO or ROM. In general use of these components saves LC (Logic Cell) count. But the access time overhead increases. Hence, it is advised that before using these blocks the time budgeting should be done. Also it is recommended that, in case of RAMs, sync components should be used. Use of async components in RAM saves some clocks but may give clock skew problem while synthesizing.

(a) Implementing CAM (Content Addressable Memory) CAM (Content Addressable Memory) is generally used for Pattern matching kind of applications. A CAM may be viewed as an inverse of a RAM. We give address as an input parameter to RAM and in return get the corresponding data stored in that location. In a CAM, we give data as an input parameter and get the corresponding address incase the data is matched in any of the location in CAM. Output of CAM is validated by the “match_found” signal.

CAMDATA

(Input Pattern)

Address

RAMAddress DATA

M a t c h F l a g

Figure 25: Content Addressable Memory

Note: Not all FPGAs support CAM. We must refer to the Datasheet of the device before choosing one for the application.

i. Resource Usage:

Synthesis tool normally implement the logic in LCs(Logic cells) instead of ESB unless specified. User is required to set some parameter USE_EAB to “ON” to enable the tool to implement the logic inside ESBs.

When USE_EAB = “OFF”, it takes one Logic cell(LC) per memory bit.

Page 40: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 36

When USE_EAB = “ON”, it takes one embedded cell(ESB) per data output bit One ESB block can implement 32-word, 32-bit CAM. Wider or deeper CAMs can be implemented by combining multiple CAMs with some ancillary logic implemented in the LEs.

ii. Writing to CAM Like a ROM, a CAM needs to initialized before it may be used. The CAM can be initialized in two ways. One is by using the ports, and the other is by specifying a pre-generated “Memory Initialization File (MIF or HEX file)” file. Note: A CAM takes 2 clocks cycles to write single data.

iii. Reading from CAM To read patterns/addresses from CAM, three different modes may be used: • Single – Match mode • Multiple – Match mode • Fast Multiple – Match mode Single Match Mode: In single-match mode, only one clock cycle is needed to read the stored data from CAM. When a match is found, the match flag “mfound” will be asserted, and the address will be present on “maddress[ ]”. Note: In single-match mode, there should not be multiple patterns that match the same input pattern. This is illustrated below by taking an example. Here 8-bit input data pattern is matched in the CAM entries. CAM returns the address at which the input data pattern matches the corresponding entry in CAM(here 5th location is matched rseturning “101” as the matched address).”mfound” signal is asserted validating the output address(“maddress”).

0 0 1 1 1 10 0

0 0 0 01 1 1 1

0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1

0

0

0

0

0

0 0

0

0

1 1 1 1 1 1

1 1 1 1 1

0

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1 1

1

1

0

Data (pattern)

000

001

010

011

100

101

110

111

maddress[2:0]

mfound

= "101"

= '1'

CAM

Address location

Figure 26: Single Match Mode (CAM)

Page 41: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 37

Multiple-Match Mode: In multiple-match mode, the CAM takes two clock cycles to read the stored data from the CAM. When a match occurs, the match flag “mfound” will be asserted, and the address will be present on “maddress[ ]” after two clock pulses delay. The next match can be seen by asserting the “mnext” and holding it high for two clocks after “mstart” Fast Multiple-Match Mode: This mode is similar to Multiple-Match Mode except that it takes only one clock clock cycle to read from CAM and generate the valid outputs. Waveform below shows an example of Single Match Mode where CAM is first initialized(data written into) and then matched with an input data pattern.

clk

pattern

wraddr

wren

wrx

mfound

maddr

000 001 101 110 000 001 011 111

000 001 101 110 000

000 001 000 101

000 010 000 000

Don't Care Bit

Match Due to

Don't Care Bit

Figure 27:Timing diagram for Read/Write into CAM

(b) Implementing RAM/ROM

Dual-port RAM/ ROM is supported in ALTERA FPGAs. For RAMs, we strongly recommend using synchronous rather than asynchronous RAM functions. Both (read and write) to the RAM should be sync. We have seen the Clock Skew problems while synthesizing async components. Syncing the ports helps avoid these problems. In case of ROMs, we can go for Sync or Async components as needed. Unlike RAMs, use of Sync ports in ROM is not necessary. We have not seen any Clock Skew problems while synthesizing async components.

Page 42: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 38

RAM

Read Enable

Read Address

Write Enable

Write Address

Input Data

Clock

Address

Clock

ROMData Output Data Output

Figure 28: RAM & ROM in Altera FPGAs

In general use of these components saves LC count. Note: Syncing the ports leads to extra clock usage for the operation and use of async components leads to more access time overhead. Hence, it is advised that before using these blocks the time budgeting should be done.

i. Resource Usage: Synthesis tool normally implement the logic in LCs (Logic cells) instead of ESB unless specified. User is required to set some parameter USE_EAB to “ON” to enable the tool to implement the logic inside ESBs. When USE_EAB = “OFF”, it takes one Logic cell(LC) per memory bit. When USE_EAB = “ON”, it takes one embedded cell(ESB) per data output bit

ii. Writing to RAM The RAM can be initialized in two ways. One is by using the ports, and the other is by specifying a “Memory Initialization File (MIF or HEX file)” file.

iii. Reading from RAM Simple … use the ports !!!

Page 43: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 39

(c) Example of a sample DUT The following design is a sample illustrating the LPM CAM/ROM/RAM use. A 2x32 CAM, 32x32 RAM and a 32x32 ROM are instantiated in the top-level entity. In the RAM and ROM, sync ports have been used. The data flow is like this: The incoming pattern is first matched with the CAM, and then the CAM match address is decoded. This is a typical use of the CAM for pattern matching. Data is read from the RAM/ROM. And depending on the value of pattern match, data from the RAM/ROM is passed to the output port as shown in the Figure 27.

C A M

(2x32)

ROM

(32x32)

R A M(32x32)

p_address[4:0]

p_pattern [31:0]

camadd

cam_mfound

Data

selector

s_romdataout

s_ramdataout

ALL ZERO's

p_dataout[31:0]

DUT

Figure 29: A Sample DUT

Shown below is the sample DUT code for the above example

module sample ( // Inputs p_clk, p_reset, p_address, p_pattern, //Outputs p_dataout );

Page 44: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 40

// Inputs input p_clk; //Clock input p_reset; //Async Reset signal input [4:0] p_address; //Read address for the RAM/ROM input [31:0] p_pattern; //Pattern to be matched by the CAM //Outputs output [31:0] p_dataout; // Output Data // wire declaration wire s_pattern1; wire s_pattern2; wire cam_wraddress; wire cam_wren; wire camadd; wire cam_mfound; wire [31:0] s_ramwrdata; wire [4:0] s_ramwraddress; wire s_ramwren; wire s_ramrden; wire [31:0] s_ramdataout; wire [31:0] s_romdataout; // Port Mapping cam cam00 ( .pattern(p_pattern), .wraddress(cam_wraddress), .wren(cam_wren), .inclock(p_clk), .maddress(camadd), .mfound(cam_mfound) ); ram ram00 ( .clock(p_clk), .data(s_ramwrdata), .wraddress(s_ramwraddress), .rdaddress(p_address), .wren(s_ramwren), .rden(s_ramrden), .q(s_ramdataout) ); rom rom00 ( .clock(p_clk), .rdaddress(p_address),

Page 45: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 41

.q(s_rom_dataout) ); assign s_pattern1 = (~camadd) && cam_mfound; assign s_pattern2 = camadd && cam_mfound; always @ (s_pattern1,s_pattern2,s_romdataout,s_ramdataout) begin:dataout_gen if (s_pattern1) p_dataout[31:0] = s_romdataout[31:0]; else if (s_pattern2) p_dataout[31:0] = s_ramdataout[31:0]; else p_dataout[31:0] = { 32{1’b0}}; end

13. Conclusion In the end, it can be summarized that following good coding practices and knowing the techniques given in this paper, a design engineer can built a robust and efficient design. But what only matters, is how you proceed with your design. For, example if you are targeting your design to a FPGA then using internal memories is good but at the same time, for portability to ASIC one must use code for those memories.

14. Acknowledgements We would like to thank one and all who helped in the successful writing of this paper. We are indeed grateful to Mr. Jitendra Puri for helping us in reviewing the document and providing us with valuable feedback and comments in a short time.

15. References

[1]. Clifford E. Cummings, “State Machine Coding Sytles for Synthesis”, SNUG 1998 (Synopsys Users Group Conference, 1998) User Papers. [2]. “Ten Commandments of Excellent Design” by Peter Chambers, Engineering Fellow, VLSI Technology [3]. “Digital Design” by Morris M. Mano [4]. Altera Quartus Software Documentation for Implementation of Internal Memories. [5].”Xilinx Design Reuse Methodology for ASIC and FPGA Designers” by Xilinx. [6] “Application specifics Newsletter” by AMIS SEMICONDUCTOR.

Page 46: SNUG Design Tips Paper

Optimization and Design tips for FPGA/ASIC

SNUG India, 2002 Page 42

[7]. Clifford E. Cummings, “Nonblocking Assignments in Verilog Synthesis, Coding Styles That Kill!, SNUG 2000 (Synopsys Users Group Conference, 1998) User Papers.

16. Author & Contact information

For any further information please contact: Mohit Arora ([email protected]) Design Engineer Prashant Bhargava ([email protected]) Design Engineer Amit Srivastava ([email protected]) Design Engineer