Upload
abhay-sorte
View
223
Download
0
Embed Size (px)
Citation preview
8/2/2019 Very Good Notes-up2
1/304
E&CE 427: Digital Systems Engineering
Course Notes
Mark Aagaard
2006t3Fall
University of Waterloo
Dept of Electrical and Computer Engineering
September 18, 2006
8/2/2019 Very Good Notes-up2
2/304
8/2/2019 Very Good Notes-up2
3/304
8/2/2019 Very Good Notes-up2
4/304
8/2/2019 Very Good Notes-up2
5/304
8/2/2019 Very Good Notes-up2
6/304
8/2/2019 Very Good Notes-up2
7/304
8/2/2019 Very Good Notes-up2
8/304
8/2/2019 Very Good Notes-up2
9/304
8/2/2019 Very Good Notes-up2
10/304
8/2/2019 Very Good Notes-up2
11/304
CONTENTS xix
P10.7.3 Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
11 Problems on Faults, Testing, and Testability 99
P11.1Based on Smith q14.9: Testing Cost . . . . . . . . . . . . . . . . . . . . . . . . . 99
P11.2Testing Cost and Total Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
P11.3Minimum Number of Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
P11.4Smith q14.10: Fault Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
P11.5Mathematical Models and Reality . . . . . . . . . . . . . . . . . . . . . . . . . . 103
P11.6Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
P11.7Test Vector Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
P11.7.1Choice of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
P11.7.2Number of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
P11.8Time to do a Scan Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104P11.9BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 05
P11.9.1Characteristic Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 105
P11.9.2Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
P11.9.3Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
P11.9.4Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . . . . . . 111
P11.9.5Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . . . . . . 112
P11.9.6Detecting a Specific Fault . . . . . . . . . . . . . . . . . . . . . . . . . . 112
P11.9.7 Time to Run Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
P11.10Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 14
P11.11Timing Hazards and Testability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
P11.12Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 16
P11.12.1Are there any physical faults that are detectable by scan testing but not by
built-in self testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
P11.12.2Are there any physical faults that are detectable by built-in self testing but
not by scan testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
P11.13Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 17
P11.13.1Design test generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
P11.13.2Design signature analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . 117
P11.13.3Determine if a fault is detectable . . . . . . . . . . . . . . . . . . . . . . . 118
P11.13.4Testing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Part I
Course Notes
1
8/2/2019 Very Good Notes-up2
12/304
Chapter 1
VHDL: The Language
1.1 Introduction to VHDL
1.1.1 Levels of Abstraction
There are many different levels of abstraction for working with hardware:
Quantum: Schrodingers equations describe movement of electrons and holes through mate-rial.
Energy band: 2-dimensional diagrams that capture essential features of Schrodingers equa-tions. Energy-band diagrams are commonly used in nano-scale engineering.
Transistor: Signal values and time are continous (analog). Each transistor is modeled by aresistor-capacitor network. Overall behaviour is defined by differential equations in terms of
the resistors and capacitors. Spice is a typical simulation tool.
Switch: Time is continuous, but voltage may be either continuous or discrete. Linear equa-
tions are used, rather than differential equations. A rising edge may be modeled as a linearrise over some range of time, or the time between a definite low value and a definite high
value may be modeled as having an undefined or rising value.
Gate: Transistors are grouped together into gates (e.g. AND, OR, NOT). Voltages are discretevalues such as pureBoolean (0 or 1) or IEEEStandardLogic 1164, which has representations
for different types of unknown or undefined values. Time may be continuous or may be
discrete. If discrete, a common unit is the delay through a single inverter (e.g. a NOT gate
has a delay of 1 and AND gate has a delay of 2).
3
8/2/2019 Very Good Notes-up2
13/304
8/2/2019 Very Good Notes-up2
14/304
6 CHAPTER 1. VHDL
numeric_bit defines arithmetic over bit vectors and integers. We wont use bit
signals in this course, so you dont need to worry about this package.
1.1.3 Semantics
The original goal of VHDL was to simulate circuits. The semantics of the language define circuit
behaviour.
a
b
c
simulationc
8/2/2019 Very Good Notes-up2
15/304
8/2/2019 Very Good Notes-up2
16/304
8/2/2019 Very Good Notes-up2
17/304
12 CHAPTER 1. VHDL
determine which parts of the library are externally visible
Use clause use a library in an entity/architecture or another package
technically, use clauses are part of entities and packages, but they proceed the entity/package
keyword, so we list them as top-level constructs
Entity (section 1.3.3)
define interface to circuit
Architecture (section 1.3.3)
define internal signals and gates of circuit
1.3.3 Entities and Architecture
Each hardware module is described with an Entity/Architecture pair
architecture
entity
architecture
entity
Figure 1.1: Entity and Architecture
Entity: interface names, modes (in / out), types of
externally visible signals of circuit
Architecture: internals
structure and behaviour of module
library ieee;use ieee.std_logic_1164.all;
entity and_or is
port (
a, b, c : in std_logic ;
z : out std_logic
);
end and_or;
Figure 1.2: Example of an entity
1.3.3 Entities and Architecture 13
The syntax of VHDL is defined using a variation on Backus-Naur forms (BNF).
[ { use_clause } ]entity ENTITYID is
[ port (
{ SIGNALID : (in | out) TYPEID [ := expr ] ; });
]
[ { declaration } ][ begin
{ concurrent_statement } ]end [ entity ] ENTITYID ;
Figure 1.3: Simplified grammar of entity
architecture main of and_or is
signal x : std_logic;
begin
x
8/2/2019 Very Good Notes-up2
18/304
14 CHAPTER 1. VHDL
1.3.4 Concurrent Statements
Architectures contain concurrent statements Concurrent statements execute in parallel (Figure1.6)
Concurrent statements make VHDL fundamentally different from most software languages.
Hardware (gates) naturally execute in parallel VHDL mimics the behaviour of real hard-
ware.
At each infinitesimally small moment of time, each gate:
1. samples its inputs
2. computes the value of its output
3. drives the output
architecture main of bowser is
begin
x1
8/2/2019 Very Good Notes-up2
19/304
16 CHAPTER 1. VHDL
1.3.5 Component Declaration and Instantiations
There are two different syntaxes for component declaration and instantiation. The VHDL-93 syn-
tax is much more concise than the VHDL-87 syntax.
Not all tools support the VHDL-93 syntax. For E&CE 427, some of the tools that we use do not
support the VHDL-93 syntax, so we are stuck with the VHDL-87 syntax.
1.3.6 Processes
Processes are used to describe complex and potentially unsynthesizable behaviour
A process is a concurrent statement (Section 1.3.4).
The body of a process contains sequential statements (Section 1.3.7)
Processes are the most complex and difficult to understand part of VHDL (Sections 1.5 and 1.6)
process (a, b, c)
begin
y
8/2/2019 Very Good Notes-up2
20/304
18 CHAPTER 1. VHDL
1.3.8 A Few More Miscellaneous VHDL Features
Some constructs that are useful and will be described in later chapters and sections:
report : print a message on stderr while simulating
assert : assertions about behaviour of signals, very useful with report statements.
generics : parameters to an entity that are defined at elaboration time.
attributes : predefined functions for different datatypes. For example: high and low indices of a
vector.
1.4 Concurrent vs Sequential Statements
All concurrent assignments can be translated into sequential statements. But, not all sequential
statements can be translated into concurrent statements.
1.4.1 Concurrent Assignment vs Process
The two code fragments below have identical behaviour:
architecture main of tiny is
begin
b < = a ;
end main;
architecture main of tiny is
begin
process (a) begin
b
t < = ;
when =>
t < = ;
end case;
1.4.4 Coding Style
Code thats easy to write with sequential statements, but difficult with concurrent:
Sequential Statements
case is
when =>
if then
o < = ;
else
o < = ;
end if;
when =>
. . .
end case;
Concurrent Statements
Overall structure:with select
t
8/2/2019 Very Good Notes-up2
21/304
20 CHAPTER 1. VHDL
1.5 Overview of Processes
Processes are the most difficult VHDL construct to understand. This section gives an overview of
processes. Section 1.6 gives the details of the semantics of processes.
Within a process, statements are executed almost sequentially
Among processes, execution is done in parallel
Remember: a process is a concurrent statement!
entity ENTITYID is
interface declarations
end ENTITYID ;
architecture ARCHID of ENTITYID is
begin
concurrent statements =process begin
sequential statements =end process;
concurrent statements =end ARCHID;
Figure 1.11: Sequential statements in a process
Key concepts in VHDL semantics for processes: VHDL mimics hardware
Hardware (gates) execute in parallel
Processes execute in parallel with each other
All possible orders of executing processes must produce the same simulation results (wave-forms)
If a signal is not assigned a value, then it holds its previous value
All orders of executing concurrent statements must
produce the same waveforms
It doesnt matter whether you are running on a single-threaded operating system, on a multi-
threaded operating system, on a massively parallel supercomputer, or on a special hardware emu-
lator with one FPGA chip per VHDL process all simulations must be the same.
These concepts are the motivation for the semantics of executing processes in VHDL (Section 1.6)
and lead to the phenomenon of latch-inference (Section 1.5.2).
1.5. OVERVIEW OF PROCESSES 21
architecture
procA: process
stmtA1;
stmtA2;
stmtA3;
end process;
procB: process
stmtB1;
stmtB2;
end process;
execution sequence
A1
A2
A3
B1
B2
execution sequence
A1
A2
A3
B1
B2
execution sequence
A1
A2
A3
B1
B2
single threaded:procA before procB
single threaded:procB before procA
multithreaded: procA
and procB in parallel
Figure 1.12: Different process execution sequences
Figure 1.13: All execution orders must have same behaviour
Sections 1.5.11.5.3 discuss the hardware generated by processes.
Sections 1.61.6.5 discuss the behaviour and execution of processes.
8/2/2019 Very Good Notes-up2
22/304
22 CHAPTER 1. VHDL
1.5.1 Combinational Process vs Clocked Process
Each well-written synthesizable process is either combinational or clocked. Some synthesizable
processes that do not conform to our coding guidelines are both combintational and clocked. For
example, in a flip-flop with an asynchronous reset, the output is a combinational function of the
reset signal and a clocked function of the data input signal. We will deal with only with processes
that follow our coding conventions, and so we will continue to say that each process is either
combinational xor clocked.
Combinational process: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Executing the process takes part of one clock cycle Target signals are outputs of combinational circuitry
A combinational processes must have a sensitivity list
A combinational process must not have any wait statements
A combinational process must not have any rising_edges, or falling_edges
The hardware for a combinational process is just combinational circuitry
Clocked process: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Executing the process takes one (or more) clock cycles Target signals are outputs of flops
Process contains one or more wait or if rising edge statements
Hardware contains combinational circuitry and flip flops
Note: Clocked processes are sometimes called sequential processes,
but this can be easily confused with sequential statements, so in E&CE 427
well refer to synthesizable processes as either combinationalor clocked.
Example Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Combinational Process
process (a,b,c)
p1
8/2/2019 Very Good Notes-up2
23/304
8/2/2019 Very Good Notes-up2
24/304
8/2/2019 Very Good Notes-up2
25/304
28 CHAPTER 1. VHDL
1.6.2.4 Delta-Cycle Definitions
Definition simulation step: Executing one sequential assignment or process mode
change.
Definition simulation cycle: The operations that occur in one iteration of the simulation
algorithm.
Definition delta cycle: A simulation cycle that does not advance simulation time.
Equivalently: A simulation cycle with zero-delay assignments where the assignment
causes a process to resume.
Definition simulation round: A sequence of simulation cycles that all have the same
simulation time. Equivalently: a contiguous sequence of zero or more delta cycles
followed by a simulation cycle that increments time (i.e., the simulation cycle is not a
delta cycle).
Note: Official and unofficial terminology Simulation cycle and delta cycle
are official definitions in the VHDL Standard. Simulation step and simulation
round are not standard definitions. They are used in E&CE 427 because weneed words to associate with the concepts that they describe.
1.6.3 Example 1: Process Execution (Bamboozle) 29
1.6.3 Example 1: Process Execution (Bamboozle)
This example (Bamboozle) and the next example (Flummox, section 1.6.4) are very similar. The
VHDL code for the circuit is slightly different, but the hardware that is generated is the same. The
stimulus for signals a and b also differs.
entity bamboozle is
begin
end bamboozle;
architecture main of bamboozle is
signal a, b, c, d : std_logic;
beginprocA : process (a, b) begin
c < = a A N D b ;
end process;
procB : process (b, c, d)
begin
d
8/2/2019 Very Good Notes-up2
26/304
30 CHAPTER 1. VHDL
Initial conditions (Shown in slides, not in notes)
Step 1(a): Activate procA(Shown in slides, not in notes)
a
b
c d
e
U
U
U UU
procA: process (a, b) begin
c
8/2/2019 Very Good Notes-up2
27/304
32 CHAPTER 1. VHDL
a
b
c d
e
U UU
procA: process (a, b) begin
c
8/2/2019 Very Good Notes-up2
28/304
34 CHAPTER 1. VHDL
Begin next simulation cycle (Shown in slides, not in notes)
Step 1(a): Activate procB (Shown in slides, not in notes)
Step 1(b): Provisional assignment to d (Shown in slides, not in notes)
Step 1(b): Provisional assignment to e (Shown in slides, not in notes)
Step 1(c): Suspend procB (Shown in slides, not in notes)
All processes suspended (Shown in slides, not in notes)
a
b
c d
e
0 UU
procA: process (a, b) begin
c
8/2/2019 Very Good Notes-up2
29/304
36 CHAPTER 1. VHDL
Begin next simulation cycle (Shown in slides, not in notes)
Step 1: No postponed processes (Shown in slides, not in notes)
a
b
c d
e
procA: process (a, b) begin
c
8/2/2019 Very Good Notes-up2
30/304
38 CHAPTER 1. VHDL
1.6.4 Example 2: Process Execution (Flummox)
This example is a variation of the Bamboozle example from section 1.6.3.
entity flummox is
begin
end flummox;
architecture main of flummox is
signal a, b, c, d : std_logic;
begin
proc1 : process (a, b, c) begin
c < = a A N D b ;d
8/2/2019 Very Good Notes-up2
31/304
40 CHAPTER 1. VHDL
Answer:
simulation step, delta cycle, simulation cycle, simulation round
Question: What is the order of granularity, from finest to coarsest, amongst the
different granularities related to delta-cycle simulation?
Answer:
Same order as listed just above. Note: delta cycles have a finer granularitythat simulation cycles, because delta cycles do not advance time, whilesimulation cycles that are not delta cycles do advance time.
1.6.5 Example: Need for Provisional Assignments
This is an example of processes where updating signals during a simulation cycle leads to different
results for different process execution orderings.
architecture main of swindle is
begin
p_c: process (a, b) begin
c < = a A N D b ;end process;
p_d: process (a, c) begin
d < = a X O R c ;
end process;
end main;
a
b
cd
Figure 1.18: Circuit to illustrate need for provisional assignments
1.6.5 Example: Need for Provisional Assignments 41
1. Start with all signals at 0.
2. Simultaneously change to a = 1 and b = 1.
. .
If assignments are not visible within same simulation cycle (correct: i.e. provisional
assignments are used)
a
b
c
d
0
0
0
0
p_d
p_c P
P
A S
A S P A S
If p c is scheduled before p d, then d will
have a 1 pulse.
a
b
c
d
0
0
0
0
p_d
p_c P
P
A S
A S P A S
Ifp d is scheduled before p c, then d will
have a 1 pulse.
. .
If assignments are visible within same simulation cycle (incorrect)
a
b
c
d
0
0
0
0
p_d
p_c P
P
A S
A S P A S
If p c is scheduled before p d, then d will
stay constant 0.
a
b
c
d
0
0
0
0
p_d
p_c P
P
A S
A S P A S
Ifp d is scheduled before p c, then d will
have a 1 pulse.
With provisional assignments, both orders of scheduling processes result in the same behaviour
on all signals. Without provisional assignments, different scheduling orders result in different
behaviour.
8/2/2019 Very Good Notes-up2
32/304
42 CHAPTER 1. VHDL
1.6.6 Delta-Cycle Simulations of Flip-Flops
This example illustrates the delta-cycle simulation of a flip-flop. Notice how the delta-cycle simu-lation captures the expected behaviour of the flip flop: the signal q changes at the same time (10ns)
as rising edge on the clock.
p_a : process begin
a
8/2/2019 Very Good Notes-up2
33/304
44 CHAPTER 1. VHDL
Testbenches and Clock Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
env : process begin
a
8/2/2019 Very Good Notes-up2
34/304
46 CHAPTER 1. VHDL
RTL Simulation Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. Pre-processing
(a) Separate processes into combinational and non-combinational (clocked and timed)
(b) Decompose each combinational process into separate processes with one target signal
per process
(c) Sort processes into topological order based on dependencies
2. For each clock cycle or unit of time:
(a) Run non-combinational processes in any order. Non-combinational assignments read
from earlier clock cycle / time step.
(b) Run combinational processes in topological order. Combinational assignments read
from current clock cycle / time step.
1.7.2 Examples of RTL Simulation
Combinational Process Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
proc(a,b,c)
if a = 1 then
d < = b ;
e < = c ;
else
d
8/2/2019 Very Good Notes-up2
35/304
48 CHAPTER 1. VHDL
8. Run the timed process until suspend at wait for 99 ns;, which takes us from 3ns to
102ns.
9. Run combinational processes in topological order to calculate values on c, d, e from 3ns to
102ns.
Question: Draw the RTL waveforms that correspond to the delta-cycle waveform
below.
a
b
c
d
e
proc1
proc2
proc3
delta cycle
sim cycle
sim round B
B
BP
P
P
U
U
U
U
U
A
U
S
A
1
0
S
A S
U
U
E
E
P
P
A
0
U
S
A S
B
B E
E
P A S
0
1
B
B E
E
P A S
0
B E
E
P A S
1
P
P A S
1
A S
1
1
B
B
B
E
EP A S
1
0
P A S
0
102ns
0
B
BE
E E
E
E
B
B
0ns 3ns
BE
E
U
0ns+1 0ns+2 0ns+2 3ns+1 3ns+2 3ns+3
Answer:
a
b
c
d
e
U
U
U
U
U
1
0
0
1
0
1
1
0
0ns 1ns 2ns 3ns 102ns
1.7.2 Examples of RTL Simulation 49
Example: Communicating State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Note: It is easier to do a simulation by hand if you start your clock at 0
and use the first clock phase in the waveform diagram for the first values that
your VHDL code ass igns t o si gnals
Simulate If-Then-Else, Wait Until . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
huey: process
begin
clk
8/2/2019 Very Good Notes-up2
36/304
50 CHAPTER 1. VHDL
A Related Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Small changes to the code can cause significant changes to the behaviour.riri: process
begin
clk
8/2/2019 Very Good Notes-up2
37/304
8/2/2019 Very Good Notes-up2
38/304
54 CHAPTER 1. VHDL
1.8.3.3 Flops with Chip-Enable
The two code fragments below synthesize to identical hardware (flops with chip-enable lines).
If
process (clk)
begin
if rising_edge(clk) then
if (ce = 1) then
q
8/2/2019 Very Good Notes-up2
39/304
56 CHAPTER 1. VHDL
(a) Flops use if statements
(b) Flops use wait statements
Some examples of these different options are shown in figures1.211.24.
S
R
S
R
sel reset
clk
c
a
entity and_not_reg is
port (
reset,
clk,
s el : in st d_ lo gi c;
c : out std_logic
);
end;
Schematic and entity for examples of different code organizations in Figures1.211.24
Figure 1.20: Schematic and entity for and not reg
One Process, Flops, Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
architecture one_proc of and_not_reg is
signal a : std_logic;
begin
process begin
wait until rising_edge(clk);
if (reset = 1) then
a
8/2/2019 Very Good Notes-up2
40/304
58 CHAPTER 1. VHDL
Two Processes with If-Then-Else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
architecture two_proc_if of and_not_reg is
signal a : std_logic;
begin
process (clk)
begin
if rising_edge(clk) then
if (reset = 1) then
a
8/2/2019 Very Good Notes-up2
41/304
8/2/2019 Very Good Notes-up2
42/304
62 CHAPTER 1. VHDL
1.10.4 Different Widths and Arithmetic
Table 1.2: Different Vector Widths and Arithmetic Operations (+, -)
target src1/2 src2/1
narrow wide fails in elaboration
wide narrow int fails in elaboration
wide wide OK
narrow narrow narrow OK
narrow narrow int OK
Example vectorswide unsigned(7 downto 0)
narrow unsigned(4 downto 0)
1.10.5 Overloading of Comparisons
Table 1.3: Overloading of Comparison Operations (=, /=, >=, >, =, >,
8/2/2019 Very Good Notes-up2
43/304
66 CHAPTER 1 VHDL 1 11 1 U th i bl C d 67
8/2/2019 Very Good Notes-up2
44/304
66 CHAPTER 1. VHDL
1.11.1.4 Multiple if rising edges in Same Process
Multiple if rising edge statements in a process (UNSYNTHESIZABLE)
process (clk)
begin
if rising_edge(clk) then
q0
8/2/2019 Very Good Notes-up2
45/304
68 CHAPTER 1. VHDL
Synthesizable Alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A synthesizable alternative to an if rising edge statement in a for-loop is to put the if-rising-
edge outside of the for loop.
process (clk) begin
if rising_edge(clk) then
f or i in 0 to 7 lo op
q(i)
8/2/2019 Very Good Notes-up2
46/304
8/2/2019 Very Good Notes-up2
47/304
8/2/2019 Very Good Notes-up2
48/304
76 CHAPTER 1. VHDL P1.2 VHDL Syntax 77
8/2/2019 Very Good Notes-up2
49/304
1.13 VHDL Problems
P1.1 IEEE 1164
For each of thevalues in thelist below, answer whether or notit is defined in theieee.std_logic_1164
library. If it is part of the library, write a 23 word description of the value.
Values: -, #, 0, 1, A, h, H, L, Q, X, Z.
P1.2 VHDL Syntax
Answer whether each of the VHDL code fragments q2a through q2f is legal VHDL code.
NOTES: 1) ... represents a fragment of legal VHDL code.
2) For full marks, if the code is illegal, you must explain why.
3) The code has been written so that, if it is illegal, then it is illegal for both
simulation and synthesis.
q2a
architecture main of anchiceratops is
signal a, b, c : std_logic;begin
process begin
wait until rising_edge(c);
a p, b => q);
...
end main;
q2e
architecture main of pachyderm is
function inv(a : std_logic)
return std_logic is
begin
return(NOT a);
end inv;
signal p, b : std_logic;
begin
p a);
...
end main;
q2f
architecture main of apatosaurus istype state_ty is (S0, S1, S2);
signal st : state_ty;
signal p : std_logic;
begin
case st is
when S0 | S1 => p p
8/2/2019 Very Good Notes-up2
50/304
P1.3 Flops, Latches, and Combinational Circuitry
For each of the signals p...z in the architecture main ofmontevido, answer whether the signalis a latch, combinational gate, or flip-flop.
entity montevido is
port (
a, b0, b1, c0, c1, d0, d1, e0, e1 : in std_logic;
l : in std_logic_vector (1 downto 0);
p, q, r, s, t, u, v, w, x, y, z : out std_logic
);
end montevido;
architecture main of montevido issignal i, j : std_logic;
begin
i
8/2/2019 Very Good Notes-up2
51/304
entity bigckt is
port (
a, b : in std_logic;
c : out std_logic
);
end bigckt;
architecture main of bigckt is
beginprocess (a, b)
begin
if (a = 0) then
c
8/2/2019 Very Good Notes-up2
52/304
P1.6 Delta-Cycle Simulation: Pong
Perform a delta-cycle simulation of the following VHDL code by drawing a waveform diagram.
INSTRUCTIONS:
1. The simulation is to be done at the granularity of simulation-steps.
2. Show all changes to process modes and signal values.
3. Each column of the timing diagram corresponds to a simulation step that changes a signal or
process.
4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation
round by writing in the appropriate row a B at the beginning and an E at the end of the cycle
or round.5. End your simulation just before 20 ns.
architecture main of pong_machine is
signal ping_i, ping_n, pong_i, pong_n : std_logic;
begin
reset_proc: process
reset
8/2/2019 Very Good Notes-up2
53/304
P1.8 Clock-Cycle Simulation
Given the VHDL code for anapurna and waveform diagram below, answer what the values ofthe signals y, z, and p will be at the given times.
entity anapurna is
port (
clk, reset, sel : in std_logic;
a, b : in unsigned(15 downto 0);
p : out unsigned(15 downto 0)
);
end anapurna;
architecture main of anapurna is
type state_ty is (mango, guava, durian, papaya);
signal y, z : unsigned(15 downto 0);
signal state : state_ty;
begin
proc_herzog: process
begin
top_loop: loop
wait until (rising_edge(clk));
next top_loop when (reset = 1);
state
8/2/2019 Very Good Notes-up2
54/304
P1.10 VHDL VHDL Behavioural Comparison: Ichtyostega
For each of the VHDL architectures q4a through q4c, does the signal v have the same behaviouras it does in the main architecture ofichthyostega?
NOTES: 1) For full marks, if the code has different behaviour, you must explain
why.
2) Ignore any differences in behaviour in the first few clock cycles that is
caused by initialization of flip-flops, latches, and registers.
3) All code fragments in this question are legal, synthesizable VHDL code.
entity ichthyostega is
port (
clk : in std_logic;
b, c : in signed(3 downto 0);
v : out sig ne d( 3 d own to 0)
);
end ichthyostega;
architecture main of ichthyostega is
signal bx, cx : signed(3 downto 0);
begin
process begin
wait until (rising_edge(clk));bx
8/2/2019 Very Good Notes-up2
55/304
P1.11 Waveform VHDL Behavioural Comparison
Answer whether each of the VHDL code fragments q3a through q3d has the same behaviour asthe timing diagram.
NOTES: 1) Same behaviour means that the signals a, b, and c have the same values at
the end of each clock cycle in steady-state simulation (ignore any irregularities
in the first few clock cycles).
2) For full marks, if the code does not match, you must explain why.
3) Assume that all signals, constants, variables, types, etc are properly defined
and declared.
4) All of the code fragments are legal, synthesizable VHDL code.
clk
a
b
c
q3aarchitecture q3a of q3 is
begin
process begina
8/2/2019 Very Good Notes-up2
56/304
P1.12 Hardware VHDL Comparison
For each of the circuits q2aq2d, answer
whether the signal d has the same behaviour
as it does in the main architecture of q2.
entity q2 is
port (
a, clk, reset : in std_logic;
d : out std_logic
);
end q2;
architecture main of q2 is
signal b, c : std_logic;
begin
b < = 0 whe n (r es et = 1 )
else a;
process (clk) begin
if rising_edge(clk) then
c < = b ;
d < = c ;
end if;
end process;
end main;
q2a clk
a
0
reset
d
q2b clk
a
0
reset
d
q2c clk
a
0
reset
d
q2d clk
a
0
reset
d
clk
P1.13 8-Bit Register
Implement an 8-bit register that has: clock signal clk
input data vector d
output data vector q
synchronous active-high input reset
synchronous active-high input enable
P1.13.1 Asynchronous Reset
Modify your design so that the reset signal is asynchronous, rather than synchronous.
P1.13.2 Discussion
Describe the tradeoffs in using synchonous versus asynchronous reset in a circuit implemented on
an FPGA.
P1.13.3 Testbench for Register
Write a test bench to validate the functionality of the 8-bit register with synchronous reset.
92 CHAPTER 1. VHDL P1.14 Synthesizable VHDL and Hardware 93
8/2/2019 Very Good Notes-up2
57/304
P1.14 Synthesizable VHDL and Hardware
For each of the fragments of VHDL q4a...q4f, answer whether the the code is synthesizable. If thecode is synthesizable, draw the circuit most likely to be generated by synthesizing the datapath of
the code. If the the code is not synthesizable, explain why.
q4a
process begin
wait until rising_edge(a);
e < = d ;
wait until rising_edge(b);
e
8/2/2019 Very Good Notes-up2
58/304
P1.15 Datapath Design
Each of the three VHDL fragments q4aq4c, is intended to be the datapath for the same circuit.The circuit is intended to perform the following sequence of operations (not all operations are
required to use a clock cycle):
read in source and destination addresses from i src1,i src2, i dst
read operands op1 and op2 from memory
compute sum of operands sum
write sum to memory at destination address dst
write sum to output o result
i_src1
i_src2
i_dst
o_result
clk
P1.15.1 Correct Implementation?
For each of the three fragments of VHDL q4aq4c, answer whether it is a correct implementation
of the datapath. If the datapath is not correct, explain why. If the datapath is correct, answer in
which cycle you need load=1.
NOTES:1. You may choose the number of clock cycles required to execute the sequence of operations.
2. The cycle in which the addresses are on i src1, i src2, and i dst is cycle #0.
3. The control circuitry that controls the datapath will output a signal load, which will be 1when the sum is to be written into memory.
4. The code fragment with the signal declaractions, connections for inputs and outputs, and the
instantiation of memory is to be used for all three code fragments q4aq4c.
5. The memory has registered inputs and combinational (unregistered) outputs.
6. All of the VHDL is legal, synthesizable code.
-- This code is to be used for
-- all three code fragments q4a--q4c.
signal state : std_logic_vector(3 downto 0);
signal src1, src2, dst, op1, op2, sum,mem_in_a, mem_out_a, mem_out_b,
mem_addr_a, mem_addr_b
: unsigned(7 downto 0);
...
process (clk)
begin
if rising_edge(clk) then
src1 mem_we,
i_data_a => mem_in_a,
o_data_a => mem_out_a,
o_data_b => mem_out_b);
96 CHAPTER 1. VHDL P1.15 Datapath Design 97
8/2/2019 Very Good Notes-up2
59/304
q4a
op1 0);op2 0);
sum 0);
mem_in_a 0);
mem_addr_a
8/2/2019 Very Good Notes-up2
60/304
Chapter 2
RTL Design with VHDL: From
Requirements to Optimized Code
2.1 Prelude to Chapter
2.1.1 A Note on EDA for FPGAs and ASICs
The following is from John Cooleys column The Industry Gadfly from 2003/04/30. The title of
this article is: The FPGA EDA Slums.
For 2001, Dataquest reported that the ASIC market was US$16.6 billion while the
FPGA market was US$2.6 billion.
Whats more interesting is that the 2001 ASIC EDA market was US$2.2 billion while
the FPGA EDA market was US$91.1 million. Nope, thats not a mistake. Its ASIC
EDA and billion versus FPGA EDA and million. Do the math and youll see that for
every dollar spent on an ASIC project, roughly 12 cents of it goes to an EDA vendor.
For every dollar spent on a FPGA project, roughly 3.4 cents goes to an EDA vendor.
Not good.
Its the old free milk and a cow story according to Gary Smith, the Senior EDA
Analyst at Dataquest. Altera and Xilinx have fowled their own nest. Their free tools
spoil the FPGA EDA market, says Gary. EDA vendors know that theres no money
to be made in FPGA tools.
99
100 CHAPTER 2. RTL DESIGN WITH VHDL
2 2 FPGA B k d d C di G id li
2.2.2 Area Estimation 101
8/2/2019 Very Good Notes-up2
61/304
2.2 FPGA Background and Coding Guidelines
2.2.1 Generic FPGA Hardware
2.2.1.1 Generic FPGA Cell
Cell = Logic Element (LE) in Altera
= Configurable Logic Block (CLB) in Xilinx
CE
S
RD Q
comb_data_in
ctrl_in
carry_in
carry_out
flop_data_outcomb
comb_data_out
flop_data_in
2.2.2 Area Estimation
We estimate the number of FPGA cells required for a design by counting the number of flip-
flops and primary inputs that are in the fanin of each flip-flop. Only flip-flops count, because
combinational signals are collapsed into the circuity within an FPGA cell. The circuitry for any
flip-flop signal with up to four source flip-flops can be implemented on a single FPGA cell. If a
flip-flop signal is dependent upon five source flip-flops, then two FPGA cells are required.
Source flops/inputs Minimum cells
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 3
11 4
For a single target signal, this technique gives a lower bound on the number of cells needed. For
example, some functions of seven inputs require more than two cells. As a particular example, a
four-to-one multiplexer has six inputs and requires three cells.
When dealing with multiple target signals, this technique might be an overestimate, because a
single cell can drive several other cells (common subexpression elimination).
PLA and Flop for Different Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CE
S
RD Q
comb_data_in
ctrl_in
carry_in
carry_out
flop_data_outcomb
comb_data_out
flop_data_in
PLA and Flop for Same Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CE
S
RD Q
comb_data_in
ctrl_in
carry_in
carry_out
flop_data_outcomb
comb_data_out
flop_data_in
102 CHAPTER 2. RTL DESIGN WITH VHDL
PLA d Fl f S F ti
2.2.2 Area Estimation 103
E ti t A f Ci it
8/2/2019 Very Good Notes-up2
62/304
PLA and Flop for Same Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CE
S
RD Q
comb_data_in
ctrl_in
carry_in
carry_out
flop_data_outcomb
comb_data_out
flop_data_in
Estimate Area for Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: Map the combinational circuits below onto generic FPGA cells.
a
b
c
d
zCE
S
RD Q
comb
abcd
z
a
b
c
dz y
xe
f
g
h
i
CE
S
RD Q
comb
CE
S
RD Q
comb
xz
y
zy
abcd
a
b
c
dz
w
xe
f
g
h
i
y
CE
S
RD Q
comb
CE
S
RD Q
comb
CE
S
RD Q
comb
xz
y
zy
abcd
bcd
w
104 CHAPTER 2. RTL DESIGN WITH VHDL
2 2 2 1 Interconnect for Generic FPGA
2.2.2 Area Estimation 105
8/2/2019 Very Good Notes-up2
63/304
2.2.2.1 Interconnect for Generic FPGA
Note: In these slides, the space between tightly grouped wires sometimes
disappears, making a group of wires appear to be a single large wire.
There are two types of wires that connect a cell to the rest of the chip:
General purpose interconnect (configurable, slow)
Carry chains and cascade chains (verticaly adjacent cells, fast)
2.2.2.2 Blocks of Cells for Generic FPGA
Cells are organized into blocks. There is a great deal of interconnect (wires) between cells within
a single block. In large FPGAs, the blocks are organized into larger blocks. These large blocks
might themselves be organized into even larger blocks. Think of an FPGA as bunch of nested
for-generate statements that replicate a single component (cell) hundreds of thousands of
times.
Cells not used for computation can be used as wires to shorten length of path between cells.
8/2/2019 Very Good Notes-up2
64/304
108 CHAPTER 2. RTL DESIGN WITH VHDL
2.2.4 Altera APEX20K Information and Coding Guidelines
2.3. DESIGN FLOW 109
2.3 Design Flow
8/2/2019 Very Good Notes-up2
65/304
2.2.4 Altera APEX20K Information and Coding Guidelines
APEX20K Block Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chip
52 Mega Logic Array Blocks (MegaLABs)
1 Embedded System Block (ESB)
Memory and wide combinational
functions
16 Logic Array Blocks (LABs)
10 Logic Elements (LEs)
4-input lookup table
Carry and cascadeFlip-flop
Each level of hierarchy has its own interconnect (wires).
LE Computation and Storage . . . . . . . . .
4-input lookup table (LUT)
Carry-chain computation circuitry
Cascade-chain computation circuitry
Flip-flop with load, clear, clock-enable
LE Interconnect . . . . . . . . . . . . . . . . . . . . . .
4 data inputs 2 data outputs
Carry in, carry out
Cascade in, cascade out
Clock, clock-enable
Async clear, synch set (load), synch clear(reset)
Global reset
Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Altera APEX20K chips initialize all flip flops to 0 at startup. To mimic this behaviour in
simulation, you should put an initial value of 0 on all flip flops. If you are doing your own
encoding for a state machine, choose the reset state to be encoded as all zeroes.
You should not put initial values on inputs or combinational signals.
2.3 Design Flow
2.3.1 Generic Design Flow
Most people agree on the general terminology and process for a digital hardware design flow.
However, each book and course has its own particular way of presenting the ideas. Here we will
lay out the consistent set of definitions that we will use in E&CE 427. This might be different from
what you have seen in other courses or on a work term. Focus on the ideas and you will be fine
both now and in the future.
The design flow presented here focuses on the artifacts that we work with, rather than the opera-
tions that are performed on the artifacts. This is because the same operations can be performed at
different points in the design flow, while the artifacts each have a unique purpose.
Analyze
Modify
Analyze
Modify
Analyze
Modify
Analyze
Modify
Analyze
Modify
Requirements
Opt. RTL Code
Implementation
Hardware
DP+Ctrl Code
High-Level Model
dp/ctrl
specific
Algorithm
Figure 2.1: Generic Design Flow
8/2/2019 Very Good Notes-up2
66/304
112 CHAPTER 2. RTL DESIGN WITH VHDL
Storage
2.4. ALGORITHMS AND HIGH-LEVEL MODELS 113
2.3.3.3 Control-Centric Design Flow
8/2/2019 Very Good Notes-up2
67/304
Purpose: hold data for future use
Data is not modified while stored
Examples: register files, FIFO queues
Control
Purpose: modify internal state based on inputs, compute outputs from state and inputs
Mostly individual signals, few data (vectors)
Examples: bus arbiters, memory-controllers
All three classes of circuits (datapath, control, and storage) follow the same generic design flow
(Figure2.1) and use dataflow diagrams, hardware block diagrams, and state machines. The differ-
ences in the design flows appear in the relative amount of effort spent on each type of description
and the order in which the different descriptions are used. The differences are most pronouncedin the transition from the high-level model to the model that separates the datapath and control
circuitry.
2.3.3.2 Datapath-Centric Design Flow
Analyze
Modify
Analyze
Modify
Block Diagram State Machine
High-Level Model
Dataflow
DP+Ctrl RTL Code
Figure 2.2: Datapath-Centric Design Flow
Analyze
Modify
Analyze
Modify
Analyze
Modify
High-Level Model
State Machine
Dataflow Diagram
Block Diagram
DP+Ctrl RTL Code
Figure 2.3: Control-Centric Design Flow
2.3.3.4 Storage-Centric Design Flow
In E&CE 427, we wont be discussing storage-centric design. Storage-centric design differs from
datapath- and control-centric design in that storage-centric design focusses on building many repli-
cated copies of small cells.
Storage-centric designs include a wide range of circuits, from simple memory arrays to compli-
cated circuits such as register files, translation lookaside buffers, and caches. The complicated
circuits can contain large and very intricate state machines, which would benefit from some of the
techniques for control-centric circuits.
2.4 Algorithms and High-Level Models
For designs with significant control flow, algorithms can be described in software languages, flow-
charts, abstract state machines, algorithmic state machines, etc.
For designs with trivial control flow (e.g. every parcel of input data undergoes the same computa-
tion), data-dependency graphs (section 2.4.2) are a good way to describe the algorithm.
For designs with a small amount of control flow (e.g. a microprocessor, where a single decision is
made based upon the opcode) a set of data-dependency graphs is often a good choice.
114 CHAPTER 2. RTL DESIGN WITH VHDL
Software executes in series;
2.4.3 High-Level Models 115
2.4.3 High-Level Models
8/2/2019 Very Good Notes-up2
68/304
;hardware executes in parallel
When creating an algorithmic description of your hardware design, think about how you can repre-
sent parallelism in the algorithmic notation that you are using, and how you can exploit parallelism
to improve the performance of your design.
2.4.1 Flow Charts and State Machines
Flow charts and various flavours of state machines are covered well in many courses. Generally
everything that youve learned about these forms of description are also applicable in hardware
design.
In addition, you can exploit parallelism in state machine design to create communicating finite state
machines. A single complex state machine can be factored into multiplesimple state machines that
operate in parallel and communicate with each other.
2.4.2 Data-Dependency Graphs
In software, the expression: (((((a + b) + c) + d) + e) + f) takes the same amount
of time to execute as: ( a + b ) + ( c + d ) + ( e + f ) .
But, remember: hardware runs in parallel. In algorithmic descriptions, parentheses can guideparallel vs serial execution.
Datadependency graphs capture algorithms of datapath-centric designs.
Datapath-centric designs have few, if any, control decisions: every parcel of input data undergroes
the same computation.
Serial Parallel
(((((a+b)+c)+d)+e)+f) (a+b)+(c+d)+(e+f)a b c d e f
+
+
+
+
+
a b c d e f
+
+
+
+
+
5 adders on longest path (slower) 3 adders on longest path (faster)
5 adders used (equal area) 5 adders used (equal area)
There are many different types of high-level models, depending upon the purpose of the model
and the characteristics of the design that the model describes. Some models may capture power
consumption, others performance, others data functionality.
High-level models are used to estimate the most important design metrics very early in the design
cycle. If power consumption is more important that performance, then you might write high-
level models that can predict the power consumption of different design choices, but which has
no information about the number of clock cycles that a computation takes, or which predicts the
latency inaccurately. Conversely, if performance is important, you might write clock-cycle accurate
high-level models that do not contain any information about power consumption.
Conventionally, performance has been the primary design metric. Hence, high-level models that
predict performance are more prevalent and more well understood than other types of high-levelmodels. There are many research and entrepreneurial opportunities for people who can develop
tools and/or languages for high-level models for estimating power, area, maximum clock speed,
etc.
In E&CE 427 we will limit ourselves to the well-understood area of high-level models for perfor-
mance prediction.
8/2/2019 Very Good Notes-up2
69/304
118 CHAPTER 2. RTL DESIGN WITH VHDL
As with all topics in E&CE 427, there are tradeoffs between these different styles of writing state
machines Most books teach only the explicit current+next style This style is the style closest to
2.5.2 Implementing a Simple Moore Machine 119
2.5.2.1 Implicit Moore State Machine
8/2/2019 Very Good Notes-up2
70/304
machines. Most books teach only the explicit-current+next style. This style is the style closest to
the hardware, which means that they are more amenable to optimization through human interven-
tion, rather than relying on a synthesis tool for optimization. The advantage of the implicit style isthat they are concise and readable for control flows consisting of nested loops and branches (e.g.
the type of control flow that appears in software). For control flows that have less structure, it
can be difficult to write an implicit state machine. Very few books or synthesis manuals describe
multiple-wait statement processes, but they are relatively well supported among synthesis tools.
Because implicit state machines are written with loops, if-then-elses, cases, etc. it is difficult to
write some state machines with complicated control flows in an implicit style. The following
example illustrates the point.
s0/0
s1/1
s2/0
s3/0
a
!a
!a
a
Note: The terminology of explicit and implicit is somewhat standard,
in that some descriptions of processes with multiple wait statements describe
the processes as having implicit state machines.
There is no standard terminology to distinguish between the two explicit styles:
explicit-current+next and explicit-current.
2.5.2 Implementing a Simple Moore Machine
s0/0
s1/1 s2/0
s3/0
a !aentity simple is
port (
a, clk : in std_logic;z : out std_logic
);
end simple;
architecture moore_implicit of simple is
beginprocess
begin
z
8/2/2019 Very Good Notes-up2
71/304
architecture moore_explicit_v1 of simple is
type state_ty is (s0, s1, s2, s3);signal state : state_ty;
begin
process (clk)
begin
if rising_edge(clk) then
case state is
when s0 =>
if (a = 1) then
state
8/2/2019 Very Good Notes-up2
72/304
architecture moore_explicit_v3 of simple is
type state_ty is (s0, s1, s2, s3);signal state, state_nxt : state_ty;
begin
process (clk)
begin
if rising_edge(clk) then
state
8/2/2019 Very Good Notes-up2
73/304
Mealy machines have a combinational path from inputs to outputs, which often violates good
coding guidelines for hardware. Thus, Moore machines are much more common. You shouldknow how to write a Mealy machine if needed, but most of the state machines that you design will
be Moore machines.
This is the same entity as for the simple Moore state machine. The behaviour of the Mealy machine
is the same as the Moore machine, except for the timing relationship between the output ( z) and
the input (a).
s0
s1 s2
s3
a/1 !a/0
/0/0
entity simple isport (
a, clk : in std_logic;
z : out std_logic
);
end simple;
Note: An implicit Mealy state machine is nonsensical.
In an implicit state machine, we do not have a state signal. But, as the example below illustrates,
to create a Mealy state machine we must have a state signal.
An implicit style is a nonsensical choice for Mealy state machines. Because the output is depen-
dent upon the input in the current clock cycle, the output cannot be a flop. For the output to be
combinational and dependent upon both the current state and the current input, we must create a
state signal that we can read in the assignment to the output. Creating a state signal obviates the
advantages of using an implicit style of state machine.
architecture implicit_mealy of simple is
type state_ty is (s0, s1, s2, s3);
signal state : state_ty;
begin
process
begin
state
8/2/2019 Very Good Notes-up2
74/304
architecture mealy_explicit of simple is
type state_ty is (s0, s1, s2, s3);signal state : state_ty;
begin
process (clk)
begin
if rising_edge(clk) then
case state is
when s0 =>
if (a = 1) then
state
8/2/2019 Very Good Notes-up2
75/304
All circuits should have a reset signal that puts the circuit back into a good initial state. However,
not all flip flops within the circuit need to be reset. In a circuit that has a datapath and a statemachine, the state machine will probably need to be reset, but datapath may not need to be reset.
There are standard ways to add a reset signal to both explicit and implicit state machines.
It is important that reset is tested on every clock cycle, otherwise a reset might not be noticed, or
your circuit will be slow to react to reset and could generate illegal outputs after reset is asserted.
Reset with Implicit State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
With an implicit state machine, we need to insert a loop in the process and test for reset after each
wait statement.
Here is the implicit Moore machine from section 2.5.2.1 with reset code added in bold.
architecture moore_implicit of simple is
begin
process
begin
init : loop -- outermost loop
z
8/2/2019 Very Good Notes-up2
76/304
132 CHAPTER 2. RTL DESIGN WITH VHDL
Tradeoffs in Encoding Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gray is good for low power applications where consecutivedata values typically differ by 1 (e g
2.6.1 Dataflow Diagrams Overview 133
a b c d e f
+
8/2/2019 Very Good Notes-up2
77/304
Gray is good for low-power applications where consecutivedata values typically differ by 1 (e.g.
no random jumps). One-hot usually has less combinational logic and runs faster than binary for machines with up
to a dozen or so states. With more than a dozen states, the extra flip-flops required by one-hot
encoding become too expensive.
Custom is great if you have lots of time and are incredibly intelligent, or have deep insight intothe guts of your design.
Note: Dont care values When we dont care what is the value of a signal we
assign the signal -, which is dont care in VHDL. Thi s should allow the
synthesis tool to use whatever value is most helpful in simplifying the Boolean
equations for the signal (e.g. Karnaugh maps). In the past, some groups in
E&CE 427 have used- quite succesfuly to decrease the area of their design.However, a few groups fou nd that using - increasedthe size of their design,
when they were expecting it to decrease the size. So, if you are tweaking your
design to squeeze out the last few unneeded FPGA cells, pay close attention as
to whether using - hurts or helps.
2.6 Dataflow Diagrams
2.6.1 Dataflow Diagrams Overview
Dataflow diagrams are data-dependency graphs where the computation is divided into clockcycles.
Purpose:
Provide a disciplined approach for designing datapath-centric circuits
Guide the design from algorithm, through high-level models, and finally to register transfer
level code for the datapath and control circuitry.
Estimate area and performance
Make tradeoffs between different design options
Background Based on techniques from high-level synthesis tools
Some similarity between high-level synthesis and software compilation
Each dataflow diagram corresponds to a basic block in software compiler terminology.
+
+
+
+
+
x1
x2
x3
x4
z
Data-dependency graph for z = a + b + c + d + e + f
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Dataflow diagram for z = a + b + c + d + e + f
134 CHAPTER 2. RTL DESIGN WITH VHDL
a b c d e f
+
2.6.2 Dataflow Diagrams, Hardware, and Behaviour 135
2.6.2 Dataflow Diagrams, Hardware, and Behaviour
Primary Input
8/2/2019 Very Good Notes-up2
78/304
+
+
+
+
+
x1
x2
x3
x4
z
Horizontal lines markclock cycle boundaries
The use of memory arrays in dataflow diagrams is described in section 2.7.4.
Primary Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dataflow Diagram
i
x
Hardware
i x
Behaviourclk
i
x
Register Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dataflow Diagram
i
x
Hardwarei
x
Behaviourclk
i
x
Register Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dataflow Diagram
i1
x
+
i2
Hardware
i2
xi1
+
Behaviourclk
i1
i2
x
Combinational-Component Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dataflow Diagram
i1
x
+
i2Hardware
i2
i1+ x
Behaviourclk
i1
i2
x
136 CHAPTER 2. RTL DESIGN WITH VHDL
2.6.3 Area Estimation
Maximum number ofblocks in a clock cycle is total number of that component that are needed
2.6.4 Dataflow Diagram Execution 137
2.6.4 Dataflow Diagram Execution
Execution with Registers on Both Inputs and Outputs
8/2/2019 Very Good Notes-up2
79/304
Maximum number ofsignals that cross a cycle boundary is total number ofregisters that areneeded
Maximum number ofunconnected signal tails in a clock cycle is total number of inputs thatare needed
Maximum number of unconnected signal heads in a clock cycle is total number of outputsthat are needed
The information above is only for estimating the number of components that are needed. In fact,
these estimates give lower bounds. There might be constraints on your design that will force you
to use more components (e.g., you might need to read all of your inputs at the same time).
Implementation-technologyfactors, suchas the relativesize of registers, multiplexers, and datapath
components, might force you to make tradeoffs that increase the number of datapath componentsto decrease the overall area of the circuit.
Of particular relevance to FPGAs:
With some FPGA chips, a 2:1 multiplexer has the same area as an adder.
With some FPGA chips, a 2:1 multiplexer can be combined with an adder into one FPGA cellper bit.
In FPGAs, registers are usually free, in that the area consumed by a circuit is limited by theamount of combinational logic, not the number of flip-flops.
In comparison, with ASICs and custom VLSI, 2:1 multiplexers are much smaller than adders, and
registers are quite expensive in area.
Execution with Registers on Both Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
clk
a
x1
x2
x3
x4
x5
z
0
1
2
3
4
5
6
0 1 2 3 4 5 6
x5
Execution Without Output Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
clk
a
x1
x2
x3
x4
x5
z
0
1
2
3
4
5
0 1 2 3 4 5 6
x5
138 CHAPTER 2. RTL DESIGN WITH VHDL
2.6.5 Performance Estimation
Performance Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.7 Area / Performance Tradeoffs 139
2.6.7 Area / Performance Tradeoffs
one add per clock cycle two adds per clock cycle
8/2/2019 Very Good Notes-up2
80/304
Performance Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance 1
TimeExec
TimeExec = Latency ClockPeriod
Latency = Number of clock cycles from inputs to outputs
There is much more information on performance in chapter4, which is devoted to performance.
Performance of Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Latency: count horizontal lines in diagram
Min clock period (Max clock speed) limited by longest path in a clock cycle
2.6.6 Design Analysis
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
num inputs 6
num outputs 1
num registers 6
num adders 1
min clock period delay through flop and one adder
latency 6 clock cycles
p y p y
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
0
1
2
3
4
5
6x5
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
0
1
2
3
4
x5
Note: In the Two-add design, half of the last clock cycle is wasted.
Two Adds per Clock Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
0
1
2
3
clk
a
x1
x2
x3
x4
x5
z
0 1 2 3 4 5 6
4
x5
140 CHAPTER 2. RTL DESIGN WITH VHDL
Design Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
One add per clock cycle Two adds per clock cycle
2.7. MEMORY ARRAYS AND RTL DESIGN 141
2.7 Memory Arrays and RTL Design
8/2/2019 Very Good Notes-up2
81/304
One add per clock cycle Two adds per clock cycle
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
0
1
2
3
4
5
6
x5
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
0
1
2
3
4
x5
inputs 6 6
outputs 1 1
registers 6 6
adders 1 2
clock period flop + 1 add flop + 2 add
latency 6 4
Question: Under what circumstances would each design option be fastest?
Answer:
time = latency * clock period
compare execution times for both options
T1 = 6 (Tf + Ta)T2 = 4 (Tf + 2 Ta)
One-add will be faster whenT1 < T2:
6 (Tf + Ta) < 4 (Tf+ 2 Ta)6Tf + 6Ta < 4Tf + 8Ta
2Tf < 2TaTf < Ta
Sanity check: If add is slower than flop, then want to minimize the number ofadds. One-add has fewer adds, so one-add will be faster when add is slowerthan flop.
2.7.1 Memory OperationsRead of Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dataflow DiagramM
d
mem(rd)
aHardware
WE
A
DI
DOa doM
clk
we
Behaviour
clk
a
d
a
M(a)
d
we
do
-
-
Write to Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dataflow DiagramM
M
mem(wr)
adiHardware
WE
A
DI
DOaM
clk
di
we
do
Behaviour
clk
a
d
a
M(a)
d
we
di
-
-
-
do U
-
-
Dual-Port Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
M
M
mem(wr)
a0di0
mem(rd)
a1
do1
a0M
clk
di0
we WE
A0
DI0
DO0
A1 DO1a1 do1
do0
clk
a
d
a0
M(a)
d
we
di0
-
-
-
-
aa1
do0
-
-
dM(a)
U
ddo1 -
142 CHAPTER 2. RTL DESIGN WITH VHDL
Sequence of Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
clk
we -
2.7.2 Memory Arrays in VHDL 143
architecture main of mem_not_hw is
subtype data is std_logic_vector(7 downto 0);
type data_vector is array( natural range ) of data;
signal mem : data vector(31 downto 0);
8/2/2019 Very Good Notes-up2
82/304
M
M
mem(wr)
a0
di0
mem(rd)
a1
do1
mem(rd)
do1
mem(rd)
do0
a1
a0 a
d1
a0
M(a)
ddi0
aa1
do0
dM(a)
ddo1 -
a
d2
a
-
-
-
d1
d
dM(a) -
dM(a)
?
2.7.2 Memory Arrays in VHDL
2.7.2.1 Using a Two-Dimensional Array for Memory
A memory array can be written in VHDL as a two-dimensional array:
subtype data is std_logic_vector(7 downto 0);type data_vector is array( natural range ) of data;
signal mem : data_vector(31 downto 0);
These two-dimensional arrays can be useful in high-level models and in specifications. However,
it is possible to write code using a two-dimensional array that cannot be synthesized. Also, some
synthesis tools (including Synopsys Design Compiler and FPGA Compiler) will synthesize two-
dimensional arrays very inefficiently.
The example below illustrates: lack of interface protocol, combinational write, multiple write
ports, multiple read ports.
g _ ( );
begin
y
8/2/2019 Very Good Notes-up2
83/304
subtype data is std_logic_vector(7 downto 0);type data_vector is array( natural range ) of data;
end;
entity mem is
port (
clk : in std_logic;
we : in std_logic -- write enable
a : i n u ns ig ne d( 4 d ow nt o 0) ; - - ad dr es s
di : in data; -- data_in
do : out data -- data_out
);
end mem;
architecture main of mem is
signal mem : data_vector(31 downto 0);
begin
do
8/2/2019 Very Good Notes-up2
84/304
needs, you can construct your own component from smaller ones.
WE
A
DI
DO
WE
A
DI
DO
NxW NxW
WriteEn
Addr
DataIn[W-1..0]DataIn[2W-1..2]
Clk
DataOut[W-1..0]DataOut[2W-1..W]
Figure 2.4: An N2W memory from NW components
WE
A
DI
DO
WE
A
DI
DO
NxW
NxW
WriteEn
Addr[logN-1..0]
DataIn
Clk
DataOut
Addr[logN]
10
Figure 2.5: A 2NW memory from NW components
use ieee.std_logic_1164.all;use ieee.numeric_std.all;
entity ram16x4s is
port (
clk, we : in std_logic;
data_in : in std_logic_vector(3 downto 0);
a ddr : i n u ns ig ne d( 3 d ow nt o 0) ;
data_out : out std_logic_vector(3 downto 0)
);
end ram16x4s;
architecture main of ram16x4s is
component ram16x1s
port (d : in std_logic; -- data in
a3, a2, a1, a0 : in std_logic; -- address
we : in std_logic; -- write enable
wclk : in std_logic; -- write clock
o : out std_logic -- data out
);
end component;
begin
mem_gen:
for i in 0 to 3 generate
ram : ram16x1s
port map (
we => we,
wclk => clk,
----------------------------------------------
-- d and o are dependent on i
a3 => addr(3), a2 => addr(2),
a1 => addr(1), a0 => addr(0),
d => data_in(i),
o => data_out(i)
----------------------------------------------
);
end generate;
end main;
148 CHAPTER 2. RTL DESIGN WITH VHDL
2.7.2.6 Dual-Ported Memory
Dual ported memory is similar to single ported memory, except that it allows two simultaneous
2.7.3 Data Dependencies 149
Purpose of Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
R3 := ......W0
8/2/2019 Very Good Notes-up2
85/304
reads, or a simultaneous read and write.
When doing a simultaneous read and write to the same address, the read will usually not see the
data currently being written.
Question: Why do dual-ported memories usually not support writes on both ports?
Answer:
What should your memory do if you write different values to the same
address in the same clock cycle?
2.7.3 Data Dependencies
Definition of Three Types of Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
There are three types of data dependencies. The names come from pipeline terminology in com-
puter architecture.
M[i] :=
:= M[i]
:=
M[i]
:=
:=
M[i]
:=
M[i]
:=
:=
M[i]
:=
Read after Write Write after Write Write after Read
(True dependency) (Load dependency) (Anti dependency)
Instructions in a program can be reordered, so long as the data dependencies are preserved.
R3 := ......
... := ... R3 ...
producer
consumer
W1
R1
W2
WAW ordering prevents W0
from happening after W1
WAR ordering prevents W2
from happening before R1
RAW ordering prevents R1
from happening before W1
R3 := ......
Each of the three types of memory dependencies (RAW, WAW, and WAR) serves a specific purpose
in ensuring that producer-consumer relationships are preserved.
Ordering of Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
M[2]
M[3]
M[3]
M[0]
:=
A
B
21
31
32
01
:=
:=
:=
M[2]
M[0]
:=
:=
M[3] M[2] M[1] M[0]
30 20 10 0
M[3]C :=
21
Initial Program with Dependencies
M[2] := 21
M[3] 31:=
A := M[2]
B := M[0]
M[3] 32:=
M[0] 01:=
C := M[3]
Valid Modification
M[2] := 21
M[3] 31:=
A := M[2]
B := M[0]
M[3] 32:=
M[0] 01:=
C := M[3]
Valid (or Bad?) Modification
Answer:
Bad modification: M[3] := 32 must happen before C := M[3].
150 CHAPTER 2. RTL DESIGN WITH VHDL
2.7.4 Memory Arrays and Dataflow Diagrams
Legend for Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.4 Memory Arrays and Dataflow Diagrams 151
Dataflow Diagrams and Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8/2/2019 Very Good Notes-up2
86/304
name
name name name (rd) name(wr)
Input port Output port State signal Array read Array write
Basic Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
mem(rd)
addr
data
mem
mem
(anti-dependency)
mem(wr)
data addrmem
mem
data := mem[addr]; mem[addr] := data;
Memory Read Memory Write
Dataflow diagrams show the dependencies between operations. The basic memory operations are
similar, in that each arrow represents a data dependency.
There are a few aspects of the basic memory operations that are potentially surprising:
The anti-dependency arrow producing mem on a read.
Reads and writes are dependent upon the entire previous value of the memory array.
The write operation appears to produce an entire memory array, rather than just updating anindividual element of an existing array.
Normally, we think of a memory array as stationary. To do a read, an address is given to the array
and the corresponding data is produced. In datalfow diagrams, it may be somewhat suprising to
see the read and write operations consuming and producing memory arrays.
Our goal is to support memory operations in dataflow diagrams. We want to model memory oper-ations similarly to datapath operations. When we do a read, the data that is produced is dependent
upon the contents of the memory array and the address. For write operations, the apparent depen-
dency on, and production of, an entire memory array is because we do not know which address
in the array will be read from or written to. The antidependency for memory reads is related to
Write-after-Read dependencies, as discussed in Section 2.7.3. There are optimizations that can be
performed when we know the address (Section 2.7.4).
Algo: mem[wr addr] := data in;data out := mem[rd addr];
data_out
mem(wr)
data_in wr_addr
rd_addr
mem
mem(rd)
mem
Read after Write
Algo: mem[wr addr] := data in;data out := mem[rd addr];
data_out
mem(wr)
data_in wr_addr
rd_addr
mem
mem(rd)
mem
Optimization when rd addr = wr addr
Algo: mem[wr1 addr] := data1;
mem[wr2 addr] := data2;
mem(wr)
mem
mem(wr)
data1 wr1_addr
wr2_addr
mem
data2
Write after Write
152 CHAPTER 2. RTL DESIGN WITH VHDL
Algo: mem[wr1 addr] := data1;
mem[wr2 addr] := data2;
wr2_addrdata2mem
2.7.5 Example: Memory Array and Dataflow Diagram 153
2.7.5 Example: Memory Array and Dataflow Diagram
data_in wr_addrmem
8/2/2019 Very Good Notes-up2
87/304
mem(wr)
mem(wr)
data1 wr1_addr
mem
Scheduling option when
wr1 addr = wr2 addr
Algo: rd data := mem[rd addr];
mem[wr addr] := wr data;
mem(wr)
mem
mem(rd)
rd_addr
wr_addr
mem
wr_data
rd_data
Write after Read
Algo: rd data := mem[rd addr];
mem[wr addr] := wr data;
mem(wr)
mem
mem(rd)
rd_addr wr_addr
mem
wr_data
rd_data
Optimization when rd addr = wr addr
M(wr)
2
M(rd)
M 21 2
M(wr)
31 3
A
0
M(rd)
B M(wr)
32 3
M(wr) 3
01 0
M(rd)
CM
M[2]
M[3]
M[3]
M[0]
:=
A
B
21
31
32
01
:=
:=
:=
M[2]
M[0]
:=
:=
M[3]C :=
1
2
3
4
5
6
7
1
2
3 4
5
6
7
Figure 2.6: Memory array example code and initial dataflow diagram
The dependency and anti-dependency arrows in dataflow diagram in Figure2.6 are based solely
upon whether an operation is a read or a write. The arrows do not take into account the address
that is read from or written to.
In figure2.7, we have used knowledge about which addresses we are accessing to remove unneeded
dependencies. These are the real dependencies and match those shown in the code fragment for
figure2.6. In figure2.8 we have placed an ordering on the read operations and an ordering on the
write operations. The ordering is derived by obeying data dependencies and then rearranging the
operations to perform as many operations in parallel as possible.
154 CHAPTER 2. RTL DESIGN WITH VHDL
M(wr)
M 21 2
M(wr)
31 30
M(rd) M(wr)
M 21 2
M(wr)
31 30
M(rd)
1 1 2
2.8. INPUT / OUTPUT PROTOCOLS 155
2.8 Input / Output Protocols
An important aspect of hardware design is choosing a input/output protocol that is easy to im-
plement and suits both your circuit and your environment Here are a few simple and common
8/2/2019 Very Good Notes-up2
88/304
M(wr)
2
M(rd)
M(wr)
A
M(rd)
B
M(wr)
32 3
M(wr)
01 0
3
M(rd)
CM
Figure 2.7: Memory array with minimal dependencies
M(wr)
2
M(rd)
M(wr)
A
M(rd)
B
M(wr)
32 3
M(wr)
01 0
3
M(rd)
CM
3
2
1 1 2
34
Figure 2.8: Memory array with orderings
M(wr)
2
M(rd)
M
21 2
M(wr)
31 3
A
0
M(rd)
B
M(wr)
32 3
M(wr)
01 03
M(rd)
C M
3
2
1 1
2
3
4
Figure 2.9: Final version of Figure2.6
Put as many parallel operations into same clock cycle as allowed by resources (one write + one
read, two reads, or one write for dual port RAM). Preserve depencies by putting dependent opera-
tions in separate clock cycles.
plement and suits both your circuit and your environment. Here are a few simple and commonprotocols.
rdy
data
ack
Figure 2.10: Four phase handshaking protocol
Used when timing of communication between producer and consumer is unpredictable. The dis-
advantage is that it is cumbersome to implement and slow to execute.
clk
data
valid
Figure 2.11: Valid-bit protocol
A low overhead (both in area and performance) protocol. Consumer must always be able to accept
incoming data. Often used in pipelined circuits. More complicated versions of the protocol can
handle pipeline stalls.
clk
data_in
start
done
data_out
Figure 2.12: Start/Done protocol
A low overhead (both in area and performance) protocol. Useful when a circuit works on one piece
of data at a time and the time to compute the result is unpredictable.
156 CHAPTER 2. RTL DESIGN WITH VHDL
2.9 Design Example: Massey
Well go through the following artifacts:
2.9.2 Algorithm 157
Maximum of two adders
Small miscellaneous hardware (e.g. muxes) is unlimited
Maximum of three inputs and one output
Design effort is unlimited
8/2/2019 Very Good Notes-up2
89/304
1. requirements
2. algorithm
3. dataflow diagram
4. high-level models
5. hardware block diagram
6. RTL code for datapath
7. state machine
8. RTL code for control
Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. Scheduling (allocate operations to clock cycles)
2. I/O allocation
3. First high-level model
4. Register allocation
5. Datapath allocation
6. Connect datapath components, insert muxes where needed
7. Design implicit state machine
8. Optimize
9. Design explicit-current state machine
10. Optimize
2.9.1 RequirementsFunctional requirements:
Compute the sum of six 8-bit numbers: o u t p u t = a + b + c + d + e + f
Use registers on both inputs and outputs
Performance requirements:
Maximum clock period: unlimited
Maximum latency: four
Cost requirements:
Design effort is unlimited
Note: In reality multiplexers are not free. In FPGAs, a 2:1 mux is more ex-
pensive t han a full-adder. A 2:1 mux has three input s whil e a n a dder has only
two inputs (the carry-in and carry-out signals usually use the special verti-
cal connections on the FPGA cell). In FPGAs, sharing an adder between two
signals can be more expensive than having two adders. In a generic-gate
technology, a multiplexor contains three two-input gates, while a full-adder
contains fourteen two-input gates.
2.9.2 Algorithm
Well use parentheses to group operations so as to maximize our opportunities to perform the work
in parallel:
z = ( a + b ) + ( c + d ) + ( e + f )
This results in the following data-dependency graph:
a b c d e f
+
+
+
+
+
158 CHAPTER 2. RTL DESIGN WITH VHDL
2.9.3 Initial Dataflow Diagram
a b c d
e f+ +
2.9.4 Dataflow Diagram Scheduling 159
Scheduling to Optimize Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Original parallel Parallel after scheduling
a b c d e f a b c d
8/2/2019 Very Good Notes-up2
90/304
z
e f+
+
+
+
+
This dataflow diagram violates the require-
ment to use at most three inputs.
2.9.4 Dataflow Diagram Scheduling
We can potentially optimize the inputs, outputs, area, and performance of a dataflow diagram by
rescheduling the operations, that is allocating the operations to different clock cycles.
Parallel algorithms have higher performance and greater scheduling flexibility than serial algo-
rithms
Ser