44
Architecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California, Los Angeles [email protected] Partially supported by NSF Grants CCR-0096383, and CCR-0306682, and Altera under the California MICRO program UCLA UCLA Joint work with Deming Chen, Lei He, Fei Li, Yan Lin

Architecture and Synthesis for Power-Efficient FPGAscong/slides/edp04_lp_fpga_webversion.pdfArchitecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California,

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

  • Architecture and Synthesis for Power-Efficient FPGAs

    Jason CongUniversity of California, Los Angeles

    [email protected]

    Partially supported by NSF Grants CCR-0096383, and CCR-0306682, and Altera under the California MICRO program

    UCLAUCLA

    Joint work with Deming Chen, Lei He, Fei Li, Yan Lin

  • Outline

    IntroductionUnderstanding Power Consumption in FPGAsArchitecture Evaluation and Power OptimizationLow Power SynthesisConclusions

  • Why? FPGA is Known to be Power Inefficient!

    FPGA consumes 50-100X more powerWhy do we care about power optimization for FPGAs ?!

    Source:[Zuchowski, et al, ICCAD02]

  • FPGA Advantages

    Short TAT (total turnaround time)No or very low NRE

  • ASICs Become Increasingly Expensive

    Traditional ASIC designs are facing rapid increase of NRE and mask-set costs at 90nm and below

    Source: EETimes

    7.512

    40

    60

    $0.0

    $0.5

    $1.0

    $1.5

    $2.0

    $2.5

    250nm 180nm 130nm 100nm

    Tot

    al C

    ost f

    or M

    ask

    Set (

    $M)

    0

    $10

    $20

    $30

    $40

    $50

    $60

    Cos

    t/Mas

    k ($

    K)

    Process (um) 2.0 … 0.8 0.6 0.35 0.25 0.18 0.13 0.10

    Single Mask cost ($K) 1.5 1.5 2.5 4.5 7.5 12 40 60

    # of Masks 12 12 12 16 20 26 30 34

    Mask Set cost ($K) 18 18 30 72 150 312 1,000 2,000

  • Our Research

    Power EfficientFPGAs

    Circuit Design

    Fabric Design

    System Design

    Synthesis Tools

  • Outline

    IntroductionUnderstanding Power Consumption in FPGAsArchitecture Evaluation and Power OptimizationLow Power SynthesisConclusions

  • FPGA Architecture

    Programmable IO

    KLUTInputs D FF

    Clock

    Out

    BLE# 1

    BLE# N

    NOutputs

    I Inputs

    Clock

    I

    N

    Programmable Logic

    Programmable Routing

  • BC-Netlist

    BC-NetlistGenerator

    Power Simulator

    Power

    BLIF

    Logic Optimization(SIS)

    Tech-Mapping (RASP)

    Timing-Driven Packing (TV-Pack)

    Placement & Routing (VPR)

    SLIF

    DelayArea

    Arch Spec

    BLIF

    Logic Optimization(SIS)

    Tech-Mapping (RASP)

    Timing-Driven Packing (TV-Pack)

    Placement & Routing (VPR)

    SLIF

    DelayArea

    Evaluation Framework – fpgaEva-LP

    fpgaEva-LP [Li, et al, FPGA’03]

  • BC-Netlist Generator

    Mapped Netlist Layout

    Buffer Extraction

    Netlist Generation for Logic Clusters

    Capacitance Extraction

    Delay Calculation

    BC-Netlist

    Back-annotation

  • Mixed-level Power Model – Overview

    Dynamic powerSwitching power Short-circuit power

    Related to signal transitions

    Functional switchGlitch

    Dynamic

    Interconnect & clock

    Macro-modelMacro-modelStatic

    Switch-level model

    Macro-model

    Logic Blockcomponents

    power sources

    Static PowerSub-threshold leakage Gate leakageReverse biased leakage

    Depending on the input vector

  • Cycle-Accurate Power Simulator

    Mixed-level Power Model

    Post-layout extracted delay & capacitance

    Random Vector Generation

    BC-Netlist

    Cycle Accurate Power Simulation with Glitch Analysis

    All cycles finished?

    No

    Power Values

    Yes∑ ∑∈ ∈

    +=activei idlej

    sacycle nEnEE

    )()(

  • Logic Block Power19%

    Interconnect Power59%

    Clock Power22%

    Power Breakdown

    Interconnect power is dominant

    Cluster Size = 12, LUT Size = 4

    Clock Power15%

    Interconnect Power45%

    Logic Block Power40%

    Cluster Size = 12, LUT Size = 6

  • Power Breakdown (cont’d)

    Leakage Power42%

    Dynamic Power58%

    Dynamic Power48%

    Leakage Power52%

    Leakage power becomes increasingly important (100nm)

    Cluster Size = 12, LUT Size = 4 Cluster Size = 12, LUT Size = 6

  • Outline

    IntroductionUnderstanding Power Consumption in FPGAsArchitecture Evaluation and Power Optimization

    Architecture Parameter SelectionDual-Vdd/Dual-Vt FPGA Architecture

    Low Power Synthesis with Dual-VddConclusion

  • Total Power along LUT and Cluster Size Changes

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    1.6

    1.7

    1.8

    1.9

    2

    3 4 5 6 7

    LUT Size

    Tota

    l FPG

    A P

    ower

    (nor

    mal

    ized

    ge

    omet

    ric m

    ean)

    Cluster Size = 4Cluster Size = 6Cluster Size = 8Cluster Size = 10Cluster Size = 12

    Routing architecture: segmented wire with length of 4, and 50% tri-state buffers in routing switches

  • Routing Architecture Evaluation

  • Architecture of Low-power and High-performance

    0.78651.02680.88651.0502

    Cluster size 12,LUT size 4,

    Wire segment length 4,100% buffered routing

    switches

    High-performance

    (Et3)

    1.00800.89090.99040.9653

    Cluster size 10, LUT size 4,

    wire segment length 4,25% buffered routing switches

    Low-power(E3t)

    Et3E3tDelay (t)

    Energy (E)

    Best FPGA architectureApplications

    Arch. Parameter selection leads to 10% power/delay trade-offUniform FPGA fabrics provide limited power-performance tradeoffNeed to explore heterogeneous FPGA fabrics, e.g. dual-Vt and dual-Vdd fabrics

  • Outline

    IntroductionUnderstanding Power Consumption in FPGAsArchitecture Evaluation and Power Optimization

    Architecture Parameter SelectionDual-Vdd/Dual-Vt FPGA Architecture [Li, et al, FPGA’04]

    Low Power Synthesis with Dual-VddConclusion

  • Dual-Vdd LUT DesignDual-Vdd technique makes use of the timing slack to reduce power

    VddH devices on critical path performanceVddL devices on non-critical paths powerAssume uniform Vdd for one LUT

    Threshold voltage Vt should be adjusted carefullyfor different Vdd levels

    To compensate delay increaseTo avoid excessive leakage power increase

  • Vdd/Vt-Scaling for LUTsThree scaling schemes

    Constant-Vt scalingFixed-Vdd/Vt-ratio scalingConstant-leakage scaling

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    1.3v 1.0v 0.9v 0.8vVdd (V)

    Del

    ay (n

    s)

    constant Vtfixed-Vdd/Vt-ratioconstant leakage

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    1.3v 1.0v 0.9v 0.8v

    Vdd (V)

    Leak

    age

    Pow

    er (

    uW)

    constant Vtfixed-Vdd/Vt-ratioconstant leakage

    Constant-leakage scaling obtains a good tradeoffuseful for both single-Vddscaling and dual-Vdd design

  • Dual-Vt LUT DesignLUT is divided into two parts

    Part I: configuration cells high VtPart II: MUX tree and input buffers normal Vt (decided by constant-leakage Vdd-scaling)

    Configuration SRAM cellsContent remains unchanged after configurationRead/write delay is not related to FPGA performance

    Use high Vt ~40% of VddMaintain signal integrityReduce SRAM leakage by 15Xand LUT leakage by 2.4XIncrease configuration time by 13%

  • Pre-Defined Dual-Vt FabricPower saving

    11.6% for combinational circuits14.6% for sequential circuits

    12.4%0.180spla9.4%0.0927seq

    power savingpower (watt)

    11.6%Avg.

    14.7%0.256pdc9.4%0.0753misex311.6%0.059ex5p17.3%0.179ex101010.7%0.234des12.3%0.0536apex49.3%0.108apex28.5%0.0798alu4

    arch-SVDT (Dual Vt)

    arch-SVST (Single Vt)Circuit

    Table1 Combinational circuits

    14.0%0.0351tseng10.2%0.261s38484

    power savingpower (watt)

    14.6%Avg.

    11.7%0.307s3841713.4%0.0736s29819.2%0.190frisc16.3%0.140elliptic14.5%0.134dsip19.7%0.0391diffeq14.8%0.632clma12.3%0.148bigkey

    arch-SVDT (Dual Vt)

    arch-SVST(Single Vt)circuit

    Table2 Sequential circuits

  • Dual-Vdd FPGA FabricGranularity: logic block (i.e., cluster of LUTs)

    Smaller granularity => intuitively more power savingBut a larger implementation overhead

    Layout pattern: pre-defined dual-Vdd patternRow-based or interleaved patternRatio of VddL/VddH blocks is 2:1 (benchmark profiling)

    Interconnect uses uniform VddH

    L-block: VddL

    H-block: VddH

  • Simple Design Flow for Dual-Vdd FabricBased on traditional design flow, but with new steps

    Step I: LUT mapping (FlowMap) + P & Rassuming uniform VddH (using VPR)

    Step II: Dual-Vdd assignment based on sensitivity

    Setp III: Timing driven P & R considering pre-defined dual-Vdd pattern (modified VPR)

  • Comparison Between Vdd-Scaling and Dual-Vdd

    For high clock frequency, dual Vdd achieves ~6% total power saving (~18% logic power saving)For low clock frequency, single-Vdd scaling is betterStill a large gap between ideal dual-Vdd and real case

    Ideal dual-Vdd is the result without layout pattern constraint

    circuit: alu4

    0.03

    0.04

    0.05

    0.06

    0.07

    0.08

    0.09

    65 75 85 95 105 115 125

    Max. Clock Frequency (MHz)

    Power (watt)

    arch-SVDT (Vdd Scaling)

    arch-DVDT(ideal case)

    arch-DVDT(pre-defined Vdd)

    1.3v

    1.0v

    0.9v

    1.3v/0.8v

    1.0v/0.8v0.9v/0.8v

    1.5v

    1.5/1.0v

    1.3/1.0v

    1.0/0.9v

    1.5v/1.0v

    1.3/0.9v

  • Vdd-Programmable Logic BlockPower switches for Vdd selection and power gatingOne-bit control is needed for Vdd selection, but two-bit control power gating

  • Experimental Results with Vdd-Programmable Blocks

    Power v.s. performanceCircuit: alu4

    0.03

    0.04

    0.05

    0.06

    0.07

    0.08

    0.09

    65 75 85 95 105 115 125

    clock frequency (MHz)

    total power (watt)

    arch-SV (Vdd scaling)arch-DV (configurable Vdd)arch-DV (ideal case)arch-DV (pre-defined Vdd)

    1.3

    v

    1.0v

    1.5v/1.0v

    1.3v/0.8v

    1.0v/0.8v

    1.5v/1.0v

    1.3v/0.9v

    1.0v/0.8v

    1.5v/0.8v

    1.3v/0.8v

    1.0v/0.9v

    1.5v

    0.9v/0.8v

    1.0v/0.8v

    1.3v/0.8v

    1.5v/1.0v

  • Outline

    IntroductionUnderstanding Power Consumption in FPGAsArchitecture Evaluation and Power OptimizationLow Power SynthesisConclusions

  • Low Power Synthesis for Dual Vdd FPGAs

    FPGA architecture with dual-Vdds adds new layout constraints for synthesis toolsNovel synthesis tools are required to support the architecture

    Technology mapping [Chen, et al, FPGA’04]Circuit clustering [Chen, et al, ISLPED’04]

  • Technology Mapping for Low-Power FPGAs with Dual Vdds

    ac

    d

    yxz

    b

    w

    e

    fg

    Cut Enumeration:

    Topological Order from PIs to POs.

    Delay 1, Power 1

    Delay 2, Power 2

    Optimal Delay = 1

    Power = 1.5

    Optimal Delay = 2

    Power = 2.5

    Delay 2, Power 3.2

    Delay 2, Power 3.5

    Delay 2, Power 2.5

    Optimal Delay = 1

    Power = 1

    Optimal Delay = 1

    Power = 1

    Represent 1 case: single high Vdd case

  • Dual-Vdd Cases

    Consider:Converter delay & powerVddL LUT delay & powerVddH LUT delay & power

    a c

    d

    yxz

    b

    w

    eTargetLUT

    Cases Input LUT Target LUT Converter1 VddL VddL No2 VddL VddH Yes3 VddH VddL No4 VddH VddH No

    Input LUT

    Four extra cases for dual-Vddconsideration

    Produce these four cases for each cut and nodeMore tradeoff solution points Smaller power requires larger delaySmaller delay requires larger power

  • Low Vdd LUTHigh Vdd LUT

    Mapping Solution Generation

    From POs to PIsCritical path driven by VddHLUTNon-critical paths can be driven by VddL LUT, guided by low power

    ac

    d

    yxz

    b

    w

    e

    fg

  • Two Types of Required Times

    VddL VddH

    33.2

    R

    x y

    If R is using VddH:

    converter

    Req’d times

    Mapped LUTs

    1.7 = 2.0 - 0.3

    Critical path

    If R is using VddL:

    Critical path

    1.8 2.0 Req’d times propagated back

    Req’d time of R is 1.7

    Req’d time of R is 1.8

    To be mapped

    Each node maintains two req’d times:

    Propagated separately

    Interact with each other

  • Experimental Results

    - 2.10%- 1.29%0.56%- 4.04%

    Real powerEst'ed powerTotal edgesMapping area

    SVmap (Single high Vdd) compared to Emap [Lamoureux, ICCAD03]

    Mapping area considerably betterEstimated power very close to the real power reported after P&R

    - 9.44%- 10.72%- 11.63%v1.3 - v1.0v1.3 - v0.9v1.3 - v0.8v1.3

    DVmapSVmap

    DVmap (dual Vdd) compared to SVmap

    v1.3 as VddH and v0.8 as VddL is the best combination

  • Circuit Clustering with Dual VddsGiven:

    A mapped FPGA designAn FPGA architecture with Dual-

    Vdd configurable logic blocksGoal:

    Cluster the LUTs into logic blocksAssign voltages to the logic blocks

    such that the design hasOptimal delay Minimum power

    Constraints:Logic Block Inputs ≤ KLogic Block Size ≤ MLogic Block Outputs ≤ MLUT delay = dL or dHInter-block edge delay = D

    Input = 5Size = 3Output = 2

    LUT

    LUT LUT

    LUTLUT

    LUTLUT

  • Cluster Enumeration – An Example

    m n o p q

    r s

    t To get a cluster of size 6 on LUT tGet 1 node on r, 4 on s, then

    merge with t …., and

    Get 2 nodes on r, 3 on s …

    Common nodesPIs to POs

    Dynamic Programming

    Get 3 on r …

  • Solution Generation

    m n o p q

    r s

    t

    Cluster s1

    Cluster s2

    Solution propagation similar as [Vaishnav, ICCAD’99]

    Delay, power and voltage (form solution points) propagate through the clusters and nodes iteratively

    Try to get solutions for Cluster t1

    Get solutions for sGet solutions for r

  • Solution Curve on Node rGood solutions: Any two delay-power-vdd points

    (D1, P1, V1) and (D2, P2, V2)if D1 > D2, then P1 < P2if D1 < D2, then P1 > P2

    02468

    1012

    0 1 2 3 4 5 6 7

    H LH

    Lpower

    Delay

    H105L9.85.4H8.246.1L86.2

    VddPowerDelay

    Good delay-power-vdd points The corresponding solution curve

  • Solution Propagation

    02468

    1012

    0 1 2 3 4 5 6 7

    H LH Lp

    ower

    Delay

    02468

    1012

    0 1 2 3 4 5 6 7

    H LH Lp

    ower

    Delay

    Delay-power-vdd curve for r Delay-power-vdd curve for s

    Consider:Converter VddL LUTVddH LUTEdge delay

    All the good solutions are generated

    All the inferior solutions are pruned away0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    0 1 2 3 4 5 6 7 8 9

    Delay

    power

    LL

    LLH

    H

    Delay-power-vdd curve for cluster t1

  • Two Theorems

    The algorithm gets the minimum number of solution points, W, optimally for each node

    W is upper bounded by Lwhere L = level(v) for node v

    The algorithm is delay and power optimal for trees and delay optimal for directed acyclic graphs (DAGs) with dual-Vdd FPGAs

    22 )1( +L

  • Experimental Results Summary

    - 18.4%- 19.5%- 20.3%v1.3 - v1.0v1.3 - v0.9v1.3 - v0.8v1.3

    Dual VddSingle Vdd

    Dual-Vdd Clustering results compared to the Single high-Vdd Clustering results

    v1.3 as high Vdd and v0.8 as low Vdd is the best combination among the three

  • Outline

    IntroductionUnderstanding Power Consumption in FPGAsArchitecture Evaluation and Power OptimizationLow Power SynthesisConclusions

  • ConclusionsFPGA power consumption

    Majority on programmable interconnectsLeakage is significant

    FPGA architecture optimization for powerArchitecture parameter tuning has a limited impactUsing high Vt for configuration SRAM cells is helpfulUsing programmable dual Vdd for logic blocks is helpful

    Power-efficient FPGA architectures introduce interesting CAD problems

    Dual-Vdd mappingDual-Vdd clustering

    Up to 20% power saving reported using these algorithms