07 RTL Optimization Techniques

Contents Timing optimization Area optimization Additional readings

Budapest University of Technology and Economics

RTL Optimization Techniques

Pter Horvth

Department of Electron Devices

August 7, 2014

Pter Horvth RTL Optimization Techniques 1 / 20


Contents

Contents

timing optimization concepts and design techniquesthroughput, latency, local datapath delayloop unrolling, removing pipeline registers, register balancing

area optimization concepts and design techniquesresource requirement metrics in standard cell ASIC and FPGAcontrol-based logic reuse, priority encoders, considering technologyprimitives

additional readings



Timing optimization



Computation performance concepts

Computation performance concepts

There are three important concepts related to the computationperformance.

throughput: The amount of data processed in a single clock cycle(bits per second).latency: The time elapsed between data input and processed dataoutput (clock cycles).local datapath delays: Delay of logic between storage elements(nanoseconds). It determines the maximum clock frequency.



Timing optimization techniques

High throughput loop unrolling (pipeline)

During the high throughput optimization the time required forprocessing of a single data is irrelevant but the time elapsedbetween two input reads is minimized.Data n+1 is read while data n is still under processing.

architecture iterative of pow3 isbegin process (clk) begin if (rising_edge(clk)) then if (start = '1') then count



High throughput loop unrolling (pipeline)

powclk

xstart

0 1

x[31:0]

pow[31:0]

32

32

32

32 32

32

throughput: 8/3 = 2.7 bits/cycle;latency: 3 cycles

pow1clk

x1clk

x2clk x

x[31:0]

xpow

clk

pow[31:0]

32

32

3232

32

32

32

32

throughput: 8/1 = 8 bits/cycle;latency: 3 cycles




Low latency removing pipeline registers

The objective of the low-latency optimization is to pass the datafrom the input to the output with minimal internal processingdelay.A low-latency design uses parallelism and removes pipeline registers.

architecture pipelined of pow3 isbegin process (clk) begin if (rising_edge(clk)) then -- stage 1 x1



Low latency removing pipeline registers

pow1clk

x1clk

x2clk x

x[31:0]

xpow

clk

pow[31:0]

32

32

3232

32

32

32

32

latency: 3 cycles

x

x[31:0]

xpow

clk

pow[31:0]

3232

32

32

32

32

latency: 1 cycles




Minimizing logic delay register layers

The logic between two sequential elements is called local datapath.The delay of the slowest local datapath determines the maximumclock frequency.The local datapath delay can be reduced by additional registerlayers.

architecture single_cycle of fir isbegin process (clk) begin if (rising_edge(clk)) then if (valid = '1') then x1



Minimizing logic delay register layers

x1clk

x2clk

yclk

+

x

xC[31:0]

B[31:0]

xA[31:0]

x[31:0]

y[31:0]

32 32

32

32

32

32

32

32

32

32

32

local datapaths: 1 adder and 1multiplier

prod1clk

prod3clk

x1clk

x2clk

x[31:0]

xA[31:0]

xC

yclk

+

y[31:0]

prod2clk

xB[31:0]

32

32

32

32

3232 32

32

32 32 32

32

32

local datapaths: 1 adder or 1multiplier




Minimizing logic delay register balancing

During register balancing the logic between registers is redistributedin order to minimize the worst-case delay between any register pairs.

architecture not_balanced of add3 isbegin process (clk) begin if (rising_edge(clk)) then reg_a



Minimizing logic delay register balancing

clk

+

clk

reg_a reg_b

in_a[31:0] in_b[31:0]clk

reg_b

in_b[31:0]

+

sum[31:0]

clk

sum

323232

32 32

3232

32

32

local datapaths: 2 adders

in_a[31:0] in_b[31:0] in_c[31:0]

reg_ab_sumclk

+reg_c

clk

+sum

clk

sum[31:0]

3232 32

32 32

32

32

local datapaths: 1 adder



Area optimization



Area concepts

Area concepts

The resource requirement means the amount of the basic functionalprimitives required for implementing the described functionality.The basic functional primitives in standard cell ASICs are thestandard cells, which can be simple logic gates, flip-flops but alsomore complex arithmetic-logic functions or memories.The basic logic elements (BLE) of an FPGA consists of a logicfunction (the input number is dependent on the vendor and thedevice family), a flip-flop and a multiplexer. There are specialpurpose resoures as well, such as memory blocks, signal processingelements (multipliers) etc.



Area optimization techniques

Minimizing area control-based logic reuseControl-based logic reuse should be considered the oppositeoperation to the loop unrolling. Pipeline requires internal datastorage resources and additional logic to implement paralleloperation. These resources can be reused with the cost of areduced throughput.

in1 in2 in3 in4

+

+ +

accce

acc

clkresetreset

clk

zero

1

plr1ce

clkreset plr2

ce

clkreset

32 32 32 32

32 32

32 32

32

32

32

sel0 1 2 3

accce

FSM +

acc

clkreset

ce_accclkreset

sel_inputclk

reset ss_z

zerozero

32 32 32 32

32

32

3232

in1 in2 in3 in4

Control-based logic reuse requires anFSM to generate control signals.




Minimizing area priority encoders

The resource requirement can be improved if the mutual exclusionis exploited. The elsif statement should be used only if a priorityencoder is required and the conditions are not mutually exclusive.

architecture priority of logic isbegin process (clk) begin if (rising_edge(clk)) then if (ctrl(0) = '1') then output(0)



Minimizing area priority encoders

output_aclk output_a[31:0]

sel

0

1

input[31:0]

output_bclk output_b[31:0]

sel

0

1

output_cclk output_c[31:0]

sel

0

1

output_dclk output_d[31:0]

sel

0

1

ctrl[0]

[1]

[2]

[3]

[0]

[1][0]

[0][1][2]

32 32

32

32

32

32

32

32

32

32

32

32

4

4

4

4

without exploiting mutual exlusion

[3]

[2]

[1]

output_aclk output_a

sel

0

1

input

output_bclk output_b

sel

0

1

output_cclk output_c

sel

0

1

output_dclk output_d

sel

0

1

ctrl[0]

32 32

32

32

32

32

32

32

32

32

32

32

4

4

4

4

with exploiting mutual exclusion




Minimizing area considering technology primitives

With appropriate HDL coding style a more efficient logicsynthesis can be achieved. The synthesis tool vendors usuallyprovide coding technique proposals to improve the resourcerequirement or timing parameters of the design. The proposedcoding style takes the unique characteritics of the technologyprimitives into consideration.

utilizing block RAM modules in FPGAs: Block RAM modules donot have any reset inputs and their outputs are synchronous to aclock signal. Only HDL models with these parameters can beimplemented in block RAMs.utilizing high quality DSP units: The DSP slices in the FPGAs havesynchronous outputs. This restriction have to be taken into accountin HDL model generation.




Minimizing area considering technology primitives

architecture FFS of RAM isbegin process (clk) begin if (reset = '1') then content (others=>'0')); elsif (rising_edge(clk)) then if (write = '1') then content(address)


Additional readings

Additional readings

Steve Kilts Advanced FPGA Design, Architecture, Implementation,and OptimizationDavid Money Harris, Sarah L. Harris Digital Design and ComputerArchitecturePeter J. Ashenden Digital Design An Embedded SystemApproach Using VHDLM. Moris Mano, Charles R. Kime Logic and Computer DesignFundamentalsPong P. Chu RTL Hardware Design Using VHDLPeter Wilson Design Recipes for FPGAs


ContentsContents

Timing optimizationComputation performance conceptsTiming optimization techniques

Area optimizationArea conceptsArea optimization techniques

Additional readingsAdditional readings

Documents

07 RTL Optimization Techniques