Upload
gideontargrave7
View
257
Download
0
Embed Size (px)
DESCRIPTION
rtl
Citation preview
Contents Timing optimization Area optimization Additional readings
Budapest University of Technology and Economics
RTL Optimization Techniques
Pter Horvth
Department of Electron Devices
August 7, 2014
Pter Horvth RTL Optimization Techniques 1 / 20
Contents Timing optimization Area optimization Additional readings
Contents
Contents
timing optimization concepts and design techniquesthroughput, latency, local datapath delayloop unrolling, removing pipeline registers, register balancing
area optimization concepts and design techniquesresource requirement metrics in standard cell ASIC and FPGAcontrol-based logic reuse, priority encoders, considering technologyprimitives
additional readings
Pter Horvth RTL Optimization Techniques 2 / 20
Contents Timing optimization Area optimization Additional readings
Timing optimization
Pter Horvth RTL Optimization Techniques 3 / 20
Contents Timing optimization Area optimization Additional readings
Computation performance concepts
Computation performance concepts
There are three important concepts related to the computationperformance.
throughput: The amount of data processed in a single clock cycle(bits per second).latency: The time elapsed between data input and processed dataoutput (clock cycles).local datapath delays: Delay of logic between storage elements(nanoseconds). It determines the maximum clock frequency.
Pter Horvth RTL Optimization Techniques 4 / 20
Contents Timing optimization Area optimization Additional readings
Timing optimization techniques
High throughput loop unrolling (pipeline)
During the high throughput optimization the time required forprocessing of a single data is irrelevant but the time elapsedbetween two input reads is minimized.Data n+1 is read while data n is still under processing.
architecture iterative of pow3 isbegin process (clk) begin if (rising_edge(clk)) then if (start = '1') then count
Contents Timing optimization Area optimization Additional readings
Timing optimization techniques
High throughput loop unrolling (pipeline)
powclk
xstart
0 1
x[31:0]
pow[31:0]
32
32
32
32 32
32
throughput: 8/3 = 2.7 bits/cycle;latency: 3 cycles
pow1clk
x1clk
x2clk x
x[31:0]
xpow
clk
pow[31:0]
32
32
3232
32
32
32
32
throughput: 8/1 = 8 bits/cycle;latency: 3 cycles
Pter Horvth RTL Optimization Techniques 6 / 20
Contents Timing optimization Area optimization Additional readings
Timing optimization techniques
Low latency removing pipeline registers
The objective of the low-latency optimization is to pass the datafrom the input to the output with minimal internal processingdelay.A low-latency design uses parallelism and removes pipeline registers.
architecture pipelined of pow3 isbegin process (clk) begin if (rising_edge(clk)) then -- stage 1 x1
Contents Timing optimization Area optimization Additional readings
Timing optimization techniques
Low latency removing pipeline registers
pow1clk
x1clk
x2clk x
x[31:0]
xpow
clk
pow[31:0]
32
32
3232
32
32
32
32
latency: 3 cycles
x
x[31:0]
xpow
clk
pow[31:0]
3232
32
32
32
32
latency: 1 cycles
Pter Horvth RTL Optimization Techniques 8 / 20
Contents Timing optimization Area optimization Additional readings
Timing optimization techniques
Minimizing logic delay register layers
The logic between two sequential elements is called local datapath.The delay of the slowest local datapath determines the maximumclock frequency.The local datapath delay can be reduced by additional registerlayers.
architecture single_cycle of fir isbegin process (clk) begin if (rising_edge(clk)) then if (valid = '1') then x1
Contents Timing optimization Area optimization Additional readings
Timing optimization techniques
Minimizing logic delay register layers
x1clk
x2clk
yclk
+
x
xC[31:0]
B[31:0]
xA[31:0]
x[31:0]
y[31:0]
32 32
32
32
32
32
32
32
32
32
32
local datapaths: 1 adder and 1multiplier
prod1clk
prod3clk
x1clk
x2clk
x[31:0]
xA[31:0]
xC
yclk
+
y[31:0]
prod2clk
xB[31:0]
32
32
32
32
3232 32
32
32 32 32
32
32
local datapaths: 1 adder or 1multiplier
Pter Horvth RTL Optimization Techniques 10 / 20
Contents Timing optimization Area optimization Additional readings
Timing optimization techniques
Minimizing logic delay register balancing
During register balancing the logic between registers is redistributedin order to minimize the worst-case delay between any register pairs.
architecture not_balanced of add3 isbegin process (clk) begin if (rising_edge(clk)) then reg_a
Contents Timing optimization Area optimization Additional readings
Timing optimization techniques
Minimizing logic delay register balancing
clk
+
clk
reg_a reg_b
in_a[31:0] in_b[31:0]clk
reg_b
in_b[31:0]
+
sum[31:0]
clk
sum
323232
32 32
3232
32
32
local datapaths: 2 adders
in_a[31:0] in_b[31:0] in_c[31:0]
reg_ab_sumclk
+reg_c
clk
+sum
clk
sum[31:0]
3232 32
32 32
32
32
local datapaths: 1 adder
Pter Horvth RTL Optimization Techniques 12 / 20
Contents Timing optimization Area optimization Additional readings
Area optimization
Pter Horvth RTL Optimization Techniques 13 / 20
Contents Timing optimization Area optimization Additional readings
Area concepts
Area concepts
The resource requirement means the amount of the basic functionalprimitives required for implementing the described functionality.The basic functional primitives in standard cell ASICs are thestandard cells, which can be simple logic gates, flip-flops but alsomore complex arithmetic-logic functions or memories.The basic logic elements (BLE) of an FPGA consists of a logicfunction (the input number is dependent on the vendor and thedevice family), a flip-flop and a multiplexer. There are specialpurpose resoures as well, such as memory blocks, signal processingelements (multipliers) etc.
Pter Horvth RTL Optimization Techniques 14 / 20
Contents Timing optimization Area optimization Additional readings
Area optimization techniques
Minimizing area control-based logic reuseControl-based logic reuse should be considered the oppositeoperation to the loop unrolling. Pipeline requires internal datastorage resources and additional logic to implement paralleloperation. These resources can be reused with the cost of areduced throughput.
in1 in2 in3 in4
+
+ +
accce
acc
clkresetreset
clk
zero
1
plr1ce
clkreset plr2
ce
clkreset
32 32 32 32
32 32
32 32
32
32
32
sel0 1 2 3
accce
FSM +
acc
clkreset
ce_accclkreset
sel_inputclk
reset ss_z
zerozero
32 32 32 32
32
32
3232
in1 in2 in3 in4
Control-based logic reuse requires anFSM to generate control signals.
Pter Horvth RTL Optimization Techniques 15 / 20
Contents Timing optimization Area optimization Additional readings
Area optimization techniques
Minimizing area priority encoders
The resource requirement can be improved if the mutual exclusionis exploited. The elsif statement should be used only if a priorityencoder is required and the conditions are not mutually exclusive.
architecture priority of logic isbegin process (clk) begin if (rising_edge(clk)) then if (ctrl(0) = '1') then output(0)
Contents Timing optimization Area optimization Additional readings
Area optimization techniques
Minimizing area priority encoders
output_aclk output_a[31:0]
sel
0
1
input[31:0]
output_bclk output_b[31:0]
sel
0
1
output_cclk output_c[31:0]
sel
0
1
output_dclk output_d[31:0]
sel
0
1
ctrl[0]
[1]
[2]
[3]
[0]
[1][0]
[0][1][2]
32 32
32
32
32
32
32
32
32
32
32
32
4
4
4
4
without exploiting mutual exlusion
[3]
[2]
[1]
output_aclk output_a
sel
0
1
input
output_bclk output_b
sel
0
1
output_cclk output_c
sel
0
1
output_dclk output_d
sel
0
1
ctrl[0]
32 32
32
32
32
32
32
32
32
32
32
32
4
4
4
4
with exploiting mutual exclusion
Pter Horvth RTL Optimization Techniques 17 / 20
Contents Timing optimization Area optimization Additional readings
Area optimization techniques
Minimizing area considering technology primitives
With appropriate HDL coding style a more efficient logicsynthesis can be achieved. The synthesis tool vendors usuallyprovide coding technique proposals to improve the resourcerequirement or timing parameters of the design. The proposedcoding style takes the unique characteritics of the technologyprimitives into consideration.
utilizing block RAM modules in FPGAs: Block RAM modules donot have any reset inputs and their outputs are synchronous to aclock signal. Only HDL models with these parameters can beimplemented in block RAMs.utilizing high quality DSP units: The DSP slices in the FPGAs havesynchronous outputs. This restriction have to be taken into accountin HDL model generation.
Pter Horvth RTL Optimization Techniques 18 / 20
Contents Timing optimization Area optimization Additional readings
Area optimization techniques
Minimizing area considering technology primitives
architecture FFS of RAM isbegin process (clk) begin if (reset = '1') then content (others=>'0')); elsif (rising_edge(clk)) then if (write = '1') then content(address)
Contents Timing optimization Area optimization Additional readings
Additional readings
Additional readings
Steve Kilts Advanced FPGA Design, Architecture, Implementation,and OptimizationDavid Money Harris, Sarah L. Harris Digital Design and ComputerArchitecturePeter J. Ashenden Digital Design An Embedded SystemApproach Using VHDLM. Moris Mano, Charles R. Kime Logic and Computer DesignFundamentalsPong P. Chu RTL Hardware Design Using VHDLPeter Wilson Design Recipes for FPGAs
Pter Horvth RTL Optimization Techniques 20 / 20
ContentsContents
Timing optimizationComputation performance conceptsTiming optimization techniques
Area optimizationArea conceptsArea optimization techniques
Additional readingsAdditional readings