20
Contents Timing optimization Area optimization Additional readings Budapest University of Technology and Economics RTL Optimization Techniques Péter Horváth Department of Electron Devices August 7, 2014 Péter Horváth RTL Optimization Techniques 1 / 20

07 RTL Optimization Techniques

Embed Size (px)

DESCRIPTION

rtl

Citation preview

  • Contents Timing optimization Area optimization Additional readings

    Budapest University of Technology and Economics

    RTL Optimization Techniques

    Pter Horvth

    Department of Electron Devices

    August 7, 2014

    Pter Horvth RTL Optimization Techniques 1 / 20

  • Contents Timing optimization Area optimization Additional readings

    Contents

    Contents

    timing optimization concepts and design techniquesthroughput, latency, local datapath delayloop unrolling, removing pipeline registers, register balancing

    area optimization concepts and design techniquesresource requirement metrics in standard cell ASIC and FPGAcontrol-based logic reuse, priority encoders, considering technologyprimitives

    additional readings

    Pter Horvth RTL Optimization Techniques 2 / 20

  • Contents Timing optimization Area optimization Additional readings

    Timing optimization

    Pter Horvth RTL Optimization Techniques 3 / 20

  • Contents Timing optimization Area optimization Additional readings

    Computation performance concepts

    Computation performance concepts

    There are three important concepts related to the computationperformance.

    throughput: The amount of data processed in a single clock cycle(bits per second).latency: The time elapsed between data input and processed dataoutput (clock cycles).local datapath delays: Delay of logic between storage elements(nanoseconds). It determines the maximum clock frequency.

    Pter Horvth RTL Optimization Techniques 4 / 20

  • Contents Timing optimization Area optimization Additional readings

    Timing optimization techniques

    High throughput loop unrolling (pipeline)

    During the high throughput optimization the time required forprocessing of a single data is irrelevant but the time elapsedbetween two input reads is minimized.Data n+1 is read while data n is still under processing.

    architecture iterative of pow3 isbegin process (clk) begin if (rising_edge(clk)) then if (start = '1') then count

  • Contents Timing optimization Area optimization Additional readings

    Timing optimization techniques

    High throughput loop unrolling (pipeline)

    powclk

    xstart

    0 1

    x[31:0]

    pow[31:0]

    32

    32

    32

    32 32

    32

    throughput: 8/3 = 2.7 bits/cycle;latency: 3 cycles

    pow1clk

    x1clk

    x2clk x

    x[31:0]

    xpow

    clk

    pow[31:0]

    32

    32

    3232

    32

    32

    32

    32

    throughput: 8/1 = 8 bits/cycle;latency: 3 cycles

    Pter Horvth RTL Optimization Techniques 6 / 20

  • Contents Timing optimization Area optimization Additional readings

    Timing optimization techniques

    Low latency removing pipeline registers

    The objective of the low-latency optimization is to pass the datafrom the input to the output with minimal internal processingdelay.A low-latency design uses parallelism and removes pipeline registers.

    architecture pipelined of pow3 isbegin process (clk) begin if (rising_edge(clk)) then -- stage 1 x1

  • Contents Timing optimization Area optimization Additional readings

    Timing optimization techniques

    Low latency removing pipeline registers

    pow1clk

    x1clk

    x2clk x

    x[31:0]

    xpow

    clk

    pow[31:0]

    32

    32

    3232

    32

    32

    32

    32

    latency: 3 cycles

    x

    x[31:0]

    xpow

    clk

    pow[31:0]

    3232

    32

    32

    32

    32

    latency: 1 cycles

    Pter Horvth RTL Optimization Techniques 8 / 20

  • Contents Timing optimization Area optimization Additional readings

    Timing optimization techniques

    Minimizing logic delay register layers

    The logic between two sequential elements is called local datapath.The delay of the slowest local datapath determines the maximumclock frequency.The local datapath delay can be reduced by additional registerlayers.

    architecture single_cycle of fir isbegin process (clk) begin if (rising_edge(clk)) then if (valid = '1') then x1

  • Contents Timing optimization Area optimization Additional readings

    Timing optimization techniques

    Minimizing logic delay register layers

    x1clk

    x2clk

    yclk

    +

    x

    xC[31:0]

    B[31:0]

    xA[31:0]

    x[31:0]

    y[31:0]

    32 32

    32

    32

    32

    32

    32

    32

    32

    32

    32

    local datapaths: 1 adder and 1multiplier

    prod1clk

    prod3clk

    x1clk

    x2clk

    x[31:0]

    xA[31:0]

    xC

    yclk

    +

    y[31:0]

    prod2clk

    xB[31:0]

    32

    32

    32

    32

    3232 32

    32

    32 32 32

    32

    32

    local datapaths: 1 adder or 1multiplier

    Pter Horvth RTL Optimization Techniques 10 / 20

  • Contents Timing optimization Area optimization Additional readings

    Timing optimization techniques

    Minimizing logic delay register balancing

    During register balancing the logic between registers is redistributedin order to minimize the worst-case delay between any register pairs.

    architecture not_balanced of add3 isbegin process (clk) begin if (rising_edge(clk)) then reg_a

  • Contents Timing optimization Area optimization Additional readings

    Timing optimization techniques

    Minimizing logic delay register balancing

    clk

    +

    clk

    reg_a reg_b

    in_a[31:0] in_b[31:0]clk

    reg_b

    in_b[31:0]

    +

    sum[31:0]

    clk

    sum

    323232

    32 32

    3232

    32

    32

    local datapaths: 2 adders

    in_a[31:0] in_b[31:0] in_c[31:0]

    reg_ab_sumclk

    +reg_c

    clk

    +sum

    clk

    sum[31:0]

    3232 32

    32 32

    32

    32

    local datapaths: 1 adder

    Pter Horvth RTL Optimization Techniques 12 / 20

  • Contents Timing optimization Area optimization Additional readings

    Area optimization

    Pter Horvth RTL Optimization Techniques 13 / 20

  • Contents Timing optimization Area optimization Additional readings

    Area concepts

    Area concepts

    The resource requirement means the amount of the basic functionalprimitives required for implementing the described functionality.The basic functional primitives in standard cell ASICs are thestandard cells, which can be simple logic gates, flip-flops but alsomore complex arithmetic-logic functions or memories.The basic logic elements (BLE) of an FPGA consists of a logicfunction (the input number is dependent on the vendor and thedevice family), a flip-flop and a multiplexer. There are specialpurpose resoures as well, such as memory blocks, signal processingelements (multipliers) etc.

    Pter Horvth RTL Optimization Techniques 14 / 20

  • Contents Timing optimization Area optimization Additional readings

    Area optimization techniques

    Minimizing area control-based logic reuseControl-based logic reuse should be considered the oppositeoperation to the loop unrolling. Pipeline requires internal datastorage resources and additional logic to implement paralleloperation. These resources can be reused with the cost of areduced throughput.

    in1 in2 in3 in4

    +

    + +

    accce

    acc

    clkresetreset

    clk

    zero

    1

    plr1ce

    clkreset plr2

    ce

    clkreset

    32 32 32 32

    32 32

    32 32

    32

    32

    32

    sel0 1 2 3

    accce

    FSM +

    acc

    clkreset

    ce_accclkreset

    sel_inputclk

    reset ss_z

    zerozero

    32 32 32 32

    32

    32

    3232

    in1 in2 in3 in4

    Control-based logic reuse requires anFSM to generate control signals.

    Pter Horvth RTL Optimization Techniques 15 / 20

  • Contents Timing optimization Area optimization Additional readings

    Area optimization techniques

    Minimizing area priority encoders

    The resource requirement can be improved if the mutual exclusionis exploited. The elsif statement should be used only if a priorityencoder is required and the conditions are not mutually exclusive.

    architecture priority of logic isbegin process (clk) begin if (rising_edge(clk)) then if (ctrl(0) = '1') then output(0)

  • Contents Timing optimization Area optimization Additional readings

    Area optimization techniques

    Minimizing area priority encoders

    output_aclk output_a[31:0]

    sel

    0

    1

    input[31:0]

    output_bclk output_b[31:0]

    sel

    0

    1

    output_cclk output_c[31:0]

    sel

    0

    1

    output_dclk output_d[31:0]

    sel

    0

    1

    ctrl[0]

    [1]

    [2]

    [3]

    [0]

    [1][0]

    [0][1][2]

    32 32

    32

    32

    32

    32

    32

    32

    32

    32

    32

    32

    4

    4

    4

    4

    without exploiting mutual exlusion

    [3]

    [2]

    [1]

    output_aclk output_a

    sel

    0

    1

    input

    output_bclk output_b

    sel

    0

    1

    output_cclk output_c

    sel

    0

    1

    output_dclk output_d

    sel

    0

    1

    ctrl[0]

    32 32

    32

    32

    32

    32

    32

    32

    32

    32

    32

    32

    4

    4

    4

    4

    with exploiting mutual exclusion

    Pter Horvth RTL Optimization Techniques 17 / 20

  • Contents Timing optimization Area optimization Additional readings

    Area optimization techniques

    Minimizing area considering technology primitives

    With appropriate HDL coding style a more efficient logicsynthesis can be achieved. The synthesis tool vendors usuallyprovide coding technique proposals to improve the resourcerequirement or timing parameters of the design. The proposedcoding style takes the unique characteritics of the technologyprimitives into consideration.

    utilizing block RAM modules in FPGAs: Block RAM modules donot have any reset inputs and their outputs are synchronous to aclock signal. Only HDL models with these parameters can beimplemented in block RAMs.utilizing high quality DSP units: The DSP slices in the FPGAs havesynchronous outputs. This restriction have to be taken into accountin HDL model generation.

    Pter Horvth RTL Optimization Techniques 18 / 20

  • Contents Timing optimization Area optimization Additional readings

    Area optimization techniques

    Minimizing area considering technology primitives

    architecture FFS of RAM isbegin process (clk) begin if (reset = '1') then content (others=>'0')); elsif (rising_edge(clk)) then if (write = '1') then content(address)

  • Contents Timing optimization Area optimization Additional readings

    Additional readings

    Additional readings

    Steve Kilts Advanced FPGA Design, Architecture, Implementation,and OptimizationDavid Money Harris, Sarah L. Harris Digital Design and ComputerArchitecturePeter J. Ashenden Digital Design An Embedded SystemApproach Using VHDLM. Moris Mano, Charles R. Kime Logic and Computer DesignFundamentalsPong P. Chu RTL Hardware Design Using VHDLPeter Wilson Design Recipes for FPGAs

    Pter Horvth RTL Optimization Techniques 20 / 20

    ContentsContents

    Timing optimizationComputation performance conceptsTiming optimization techniques

    Area optimizationArea conceptsArea optimization techniques

    Additional readingsAdditional readings