42
CS250, UC Berkeley Fall ‘20 Lecture 09 CS250 VLSI Systems Design Fall 2020 John Wawrzynek with Arya Reais-Parsi

CS250 VLSI Systems Design‣ Column decoder used to select one or more columns for input/output of data 13 Storage cell could be either static or dynamic Lecture 09 CS250, UC Berkeley

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

  • CS250, UC Berkeley Fall ‘20Lecture 09

    CS250
VLSISystemsDesign

    Fall2020

    JohnWawrzynek

    with

    AryaReais-Parsi

  • CS250, UC Berkeley Fall ‘20Lecture 09

    ProjectUpdate‣ Sixteamsformed‣ Assignmentsmadebasedonyourpreferences‣ AllworkingtowardsoneFPGAdesign:‣ Hybridarraywithfine-grainedlogicblocks,widemultiply/

    accumulateblocks,blockRAMs

    ‣ Toolsupport,Completelayout

    2

    Fabric

    InterconnectConfiguration

    CLB

    MAC

    SRAMInteraction/coordination Graph

  • CS250, UC Berkeley Fall ‘20Lecture 09

    ProjectTeams‣ Fabric:JinyueZhu,Philip,Tan,(Arya)‣ High-levelfabricarchitecture‣ clocks,power,metallayerassignments‣ FPGAtoolflow(Yosys/NextPnRorVTR)‣ Testcircuits/benchmarks‣ Chiplevelsimulationandlayoutintegration

    ‣ MAC:RyanLund,Anson‣ hardblockdesignandimplementation‣ multiply/add,ALUfunctions?‣ configurabledata-pathwidth

    3

  • CS250, UC Berkeley Fall ‘20Lecture 09

    ProjectTeams‣ SRAM:Rohan,Adhiraj‣ DenseRAMblockdesignandimplementation‣ Configurablewidth/depth‣ “openRAM”(UCSD),firstoption

    ‣ CLB:Kareem,RyanThornton‣ considerseveraldesignalternatives(s44isagoodbet)‣ includecarrylogic‣ considerbothstandardcellandcustomlayout

    4

  • CS250, UC Berkeley Fall ‘20Lecture 09

    ProjectTeams‣ Interconnect:Yukio,Nate‣ Twodesigns:‣ Traditional“islandstyle”withconnectionboxesandswitch

    boxes.Perhaps“Wilton”switchboxdesign.

    ‣ Novel(columnoriented)‣ Sharelayoutpieces(programmableinterconnection

    points)

    ‣ Configuration:Josh,Aled‣ programminginterface,internalstructure‣ granularity/mechanismsforpartialreconfiguration

    5

  • CS250, UC Berkeley Fall ‘20Lecture 09

    ProjectTeamPresentations‣ Inclass,Oct1(nextThursday)‣ Target10minutes(withdiscussion)‣ Slideswithillustrations(powerpoint,…)‣ Onepresentationeachfrom:config,CLB,SRAM,MAC

    teams

    ‣ Twointerconnectionpresentations‣ Threefabricteampresentations:‣ toolsupport‣ high-levelfabricarchitecture‣ simulation,testing,integrationplan

    ‣ Followingweek,privategroupmeetingswithArya&Johntogetfeedbackandbrainstormideas

    6

  • CS250, UC Berkeley Fall ‘20Lecture 09

    ProjectTeamPresentations‣ Thepointistogetthediscussiongoingonthefunctionand

    implementationofyourpiece.

    ‣ Youareresponsiblefora“straw-man”/draftproposal‣ Okaytoleavesomeissuesopenfornow‣ Outline‣ Onlyonepersonneedstospeakbut,introduceteam

    members

    ‣ Describeyourproposedfunction/featuresandstructure(blockdiagram/circuit)ofyourpiece

    ‣ Describehowyouplantorefinethedefinitionsoffunction/structureandtooptimizethedesign

    ‣ Saysomethingaboutimplementationstrategy‣ Saysomethingaboutwhatinformationyouwillneedfrom

    otherteamsandwhatotherteamswillneedfromyou 7

  • CS250, UC Berkeley Fall ‘20Lecture 09

    ProjectTeams‣ Ifyouhavequestionsabouthow

    youendedupinwhichteam,mailmeorsetupappointment

    ‣ Ifyouhavequestionsaboutyourteam’sroleandresponsibility,asknow,ormailuslater

    ‣ Ifyoudon’thaveemailcontactsforyourotherteammembers,asknow,ormailuslater

    ‣ Toprepareforthepresentationsnextweek,notnecessaryrightnowtoreachouttoothergroups,butfeelfreetodoso

    8

    Fabric

    InterconnectConfiguration

    CLB

    MAC

    SRAMInteraction/coordination Graph

  • CS250, UC Berkeley Fall ‘20Lecture 09

    CircuitsTopics:Basic(review?)

    ‣ Processing/devices:planar,finfets,GDR‣ Devicemodels:switch,RC,Vth‣ Logiccircuits:gates,muxes,transmissiongates,FFs‣ CircuitDelay:gatedelay,wiredelay,FETsizing‣ CircuitPower:formulation/factors‣ SystemDelay:factors,optimization‣ SystemPower:factors,optimization

    9

    What you need to know as a VLSI Systems designer.

  • CS250, UC Berkeley Fall ‘20Lecture 09

    LogicCircuit‣ Logicgatesintransistors‣ TransmissionGates‣ Tri-stateBuffers‣ MultiplexorCircuits‣ Latch/Flip-flopcircuits‣ SRAMcircuits

    10

  • CS250, UC Berkeley Fall ‘20Lecture 09

    LatchesandFlip-flopsPositiveLevel-sensitivelatch:

    LatchTransistorLevel:Positive Edge-triggered flip-flop

    built from two level-sensitive latches:

    11

    clk’

    clk

    clk

    clk’

    LatchImplementation:

  • CS250, UC Berkeley Fall ‘20Lecture 09

    SRAMCellArrayDetails

    12

    Mostcommonis6-transistor(6T)cellarray.wor

    bit bit wor

    bit bit wor

    bit bit

    wor

    bit bit wor

    bit bit wor

    bit bit word line

    bit bit

    Wordselectsthiscell,andallothersinarow.

    Forwriteoperation,columnbitlinesaredrivendifferentially(0onone,1ontheother).Valuesoverwritescellstate.

    Forreadoperation,columnbitlinesareequalized(settosamevoltage),thenreleased.Cellpullsdownonebitlineortheother.

  • CS250, UC Berkeley Fall ‘20Lecture 09

    GenericMemoryBlock‣ Wordlinesusedtoselecta

    rowforreadingorwriting

    ‣ Bitlinescarrydatato/fromperiphery

    ‣ Coreaspectratiokeepcloseto1tohelpbalancedelayonwordlineversusbitline

    ‣ Addressbitsaredividedbetweenthetwodecoders

    ‣ Rowdecoderusedtoselectwordline

    ‣ Columndecoderusedtoselectoneormorecolumnsforinput/outputofdata

    13

    Storage cell could be either static or dynamic

  • CS250, UC Berkeley Fall ‘20Lecture 09

    CircuitDelay‣ RCbasedgatedelay‣ WireDelay‣ TransistorSizing

    14

  • CS250, UC Berkeley Fall ‘20Lecture 09

    TransistorsasConductors‣ ImprovedTransistorModel:nFET • We refer to transistor "strength" as the

    amount of current that flows for a given Vds and Vgs.

    • The strength is linearly proportional to the ratio of W/L.

    pFET

    15

  • CS250, UC Berkeley Fall ‘20Lecture 09

    GateDelayistheResultofCascading• Cascaded gates:

    “transfer curve” for inverter.

    16

  • CS250, UC Berkeley Fall ‘20Lecture 09

    GateDelaySummary

    17

    inverter

    2-NAND2-NOR

    tp

    f

    The y-intercepts for NAND and NOR are both twice that of the inverter. The NAND line has a gradient 4/3 that of the inverter (steeper); for NOR it is 5/3 (steepest).

    What about gates with more than 2-inputs?

    Look at 4-input NAND:

    interceptslope

  • CS250, UC Berkeley Fall ‘20Lecture 09

    DelayinFlip-flops• Setuptimeresultsfromdelaythroughfirstlatch.

    clk

    clk’

    clk

    clk’

    clk

    clk’

    clk

    clk’

    18

    ClocktoQdelayresultsfromdelaythroughsecondlatch.

  • CS250, UC Berkeley Fall ‘20Lecture 09

    WireDelay‣ Eveninthosecaseswherethe

    transmissionlineeffectisnegligible:

    ‣ Wirespossesdistributedresistanceandcapacitance

    ‣ TimeconstantassociatedwithdistributedRCisproportionaltothesquareofthelength

    • For short wires on ICs, resistance is insignificant (relative to effective R of transistors), but C is important.

    – Typically around half of C of gate load is in the wires.

    • For long wires on ICs: – busses, clock lines, global control

    signal, etc. – Resistance is significant, therefore

    distributed RC effect dominates. – signals are typically “rebuffered” to

    reduce delay:

    v1 v2 v3 v4

    19

    v1

    v4v3

    v2

    time

  • CS250, UC Berkeley Fall ‘20Lecture 09

    GateDrivinglongwireandothergates

    20

    tp = 0.69RdrCint + 0.69RdrCw + 0.38RwCw + 0.69RdrCfan + 0.69RwCfan= 0.69Rdr(Cint + Cfan) + 0.69(Rdrcw + rwCfan)L + 0.38rwcwL2

    Rw = rwL, Cw = cwL

  • CS250, UC Berkeley Fall ‘20Lecture 09

    DrivingLargeLoads‣ Largefanoutnets:clocks,resets,memorybitlines,

    off-chip

    ‣ Relativelysmalldriverresultsinlongrisetime(andthuslargegatedelay)

    ‣ Strategy:

    ‣ Howtooptimallyscaledrivers?‣ Optimaltrade-offbetweendelayperstageandtotal

    numberofstages?

    StagedBuffers

    21

  • CS250, UC Berkeley Fall ‘20Lecture 09

    CircuitPower‣ SwitchingEnergy/Power‣ ShortCircuitcurrent‣ Leakagecurrent

    22

  • CS250, UC Berkeley Fall ‘20Lecture 09 23

  • CS250, UC Berkeley Fall ‘20Lecture 09

    SwitchingEnergy:FundamentalPhysics

    24

    Every logic transition dissipates energy.

    Howcanwelimit

    switchingenergy?

    (1)Reduce#ofclocktransitions.Butwehaveworktodo...

    (2)ReduceVdd.ButloweringVddlimitstheclockspeed...

    (3)Fewercircuits.Butmoretransistorscandomorework.

    (4)ReduceCpernode.Onereasonwhywescaleprocesses.

    Spring 2003 EECS150 – Lec10-Timing Page 10

    Gate Switching Behavior

    • Inverter:

    • NAND gate:

    Vdd

    12

    C VddE0->1= 2

    Vdd

    12

    C VddE1->0= 2

    C

    Strong result: Independent of technology.

  • CS250, UC Berkeley Fall ‘20Lecture 09

    Chip-Level“Dynamic”Power

    25

    Psw = 1/2 α C Vdd2 F

    “activity factor”, average percentage of

    capacitance switching per cycle (~ number of

    nodes to switch)

    Total chip capacitance to be

    switched

    Clock Frequency

  • CS250, UC Berkeley Fall ‘20Lecture 09

    SystemDelay‣ CriticalPath‣ OptimizationTechniques‣ Clockdistribution‣

    26

  • CS250, UC Berkeley Fall ‘20Lecture 09

    InGeneral...

    T ≥ τclk→Q + τCL + τsetup

    ‣ Howdoweenumerateallpaths?– Anycircuitinputorregisteroutputtoanyregisterinputorcircuit

    output?

    • Note:– “setuptime”foroutputsisafunctionofwhatitconnectsto.– “clk-to-q”forcircuitinputsdependsonwhereitcomesfrom.

    27

    For correct operation:

    for all paths.

  • CS250, UC Berkeley Fall ‘20Lecture 09

    ComponentsofPathDelay

    1. #oflevelsoflogic2. Internalcelldelay3. wiredelay4. cellinputcapacitance5. cellfanout6. celloutputdrivestrength

    28

    How do we optimize?Tackle “critical path”

    Synthesis tools approximate path delay and attempt to optimize by rearranging logic network and choosing appropriately sized cells.

    “Logical Effort” method for hand sizing of transistors.

    Place and route tools attempt to minimize wire delay on critical paths.

  • CS250, UC Berkeley Fall ‘20Lecture 09

    Treesforoptimization

    29

    + + + + + + +x0

    x1 x2 x3 x4 x5 x6 x7

    T = O(N)

    + +

    + + +

    +

    +

    T = O(log N)

    (( x0 + x1 ) + ( x2 + x3 )) + (( x4 + x5 ) + ( x6 + x7 ))

    ((((((x0 + x1 ) + x2 ) + x3 ) + x4 ) + x5 ) + x6 ) + x7

    ❑ What property of “+” are we exploiting? ❑ Other associate operators? Boolean operations? Division? Min/Max?

    Same number of operations (N-1)

  • CS250, UC Berkeley Fall ‘20Lecture 09

    Pipelining‣ Generalprinciple:

    ‣ CuttheCLblockintopieces(stages)andseparatewithregisters:

    Assume T=8ns TFF(setup +clk→q)=1ns F = 1/9ns = 111MHz

    Assume T1 = T2 = 4ns

    30

  • CS250, UC Berkeley Fall ‘20Lecture 09

    SystemPower‣ Chip/blocklevelPower‣ Optimizationforpowerandenergyefficiency‣ Powerdistribution

    31

  • CS250, UC Berkeley Fall ‘20Lecture 09

    EnergyandPower

    ‣ Handheldandportable(batteryoperated):❑ EnergyEfficiency-limitsbatterylife❑ Power-limitedbyheat

    ‣ Infrastructureandservers(connectedtopowergrid):❑ EnergyEfficiency-dictatesoperationcost❑ Power-heatremovalcontributestoTCO

    32

    Energy Efficiency: energy per operation

    P =dWdt

    Energy is the ability to do work (W).Power is rate of expending energy.

    Remember: reducing power is easy - just slow down. Improving energy efficiency is difficult.

    Heat is a byproduct of computation. Heat dissipated is proportional to the energy used per unit time, P.

  • CS250, UC Berkeley Fall ‘20Lecture 09

    Fivelow-powerdesigntechniques

    33

    Power-down idle transistors

    Parallelism and pipelining

    Slow down non-critical paths

    Clock gating

    Thermal management

  • CS250, UC Berkeley Fall ‘20Lecture 09 34

    Gate delay roughly linear

    with Vdd

    This magic trick brought to you by Cory Hall ...

    3636

    Active Power ReductionActive Power Reduction

    Slow Fast Slow

    Lo

    w S

    up

    ply

    Vo

    ltag

    e

    Hig

    h S

    up

    ply

    Vo

    ltag

    e

    Multiple Supply

    Voltages

    Logic BlockFreq = 1

    Vdd = 1

    Throughput = 1

    Power = 1

    Area = 1

    Pwr Den = 1

    Vdd

    Logic Block

    Freq = 0.5

    Vdd = 0.5

    Throughput = 1

    Power = 0.25

    Area = 2

    Pwr Den = 0.125

    Vdd/2

    Logic Block

    Replicated DesignsAnd so, we can transform this:

    Block processes stereo audio. 1/2 of clocks for “left”, 1/2 for “right”.

    P ~ F ⨯ Vdd2

    P ~ 1 ⨯ 1 2

    Into this: Top block processes “left”, bottom “right”.

    3636

    Active Power ReductionActive Power Reduction

    Slow Fast Slow

    Lo

    w S

    up

    ply

    Vo

    ltag

    e

    Hig

    h S

    up

    ply

    Vo

    ltag

    e

    Multiple Supply

    Voltages

    Logic BlockFreq = 1

    Vdd = 1

    Throughput = 1

    Power = 1

    Area = 1

    Pwr Den = 1

    Vdd

    Logic Block

    Freq = 0.5

    Vdd = 0.5

    Throughput = 1

    Power = 0.25

    Area = 2

    Pwr Den = 0.125

    Vdd/2

    Logic Block

    Replicated Designs

    CV2 power only

    P ~ #blks ⨯ F ⨯ Vdd 2

    P ~ 2 ⨯ 1/2 ⨯ 1/4 = 1/4

  • CS250, UC Berkeley Fall ‘20Lecture 09

    Cell(PS3Chip):1CPU+8“SPUs”

    35

    PowerPC

    L2 Cache512 KB

    Synergistic Processing

    Units(SPUs)

    8

  • CS250, UC Berkeley Fall ‘20Lecture 093434

    Circuit Techniques ReduceCircuit Techniques ReduceSource Drain LeakageSource Drain Leakage

    Body BiasBody Bias

    + + VeVe

    VddVddVbpVbp

    VbnVbn

    - - VeVe

    2 - 10X2 - 10X

    Sleep TransistorSleep Transistor

    2 - 1000X2 - 1000X

    Stack EffectStack Effect

    5 - 10X5 - 10X

    Logic Logic

    BlockBlock

    Equal LoadingEqual Loading

    LeakageLeakage

    ReductionReduction

    Add“sleep”transistorstologic...

    36

    Example:Floatingpointunitlogic.

    Whenrunningfixed-pointinstructions,putlogic“tosleep”.

    +++When“asleep”,leakagepowerisdramaticallyreduced.

    ---Presenceofsleeptransistorsslowsdowntheclockratewhenthelogicblockisinuse.

  • CS250, UC Berkeley Fall ‘20Lecture 09

    Fact:Mostlogiconachipis“toofast”° Aproductthat

    37

    From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.

    netlist. Of these, 1 2 1 7 1 3 were top-level chip global nets,and 2 1 7 1 1 were processor-core-level global nets. Againstthis model 3 .5 million setup checks were performed in latemode at points where clock signals met data signals inlatches or dynamic circuits. The total number of timingchecks of all types performed in each chip run was9 .8 million. Depending on the configuration of the timingrun and the mix of actual versus estimated design data,the amount of real memory required was in the rangeof 1 2 GB to 1 4 GB, with run times of about 5 to 6 hoursto the start of timing-report generation on an RS/6 0 0 0 *Model S8 0 configured with 6 4 GB of real memory.Approximately half of this time was taken up by readingin the netlist, timing rules, and extracted RC networks, as

    well as building and initializing the internal data structuresfor the timing model. The actual static timing analysistypically took 2 .5 –3 hours. Generation of the entirecomplement of reports and analysis required an additional5 to 6 hours to complete. A total of 1 .9 GB of timingreports and analysis were generated from each chip timingrun. This data was broken down, analyzed, and organizedby processor core and GPS, individual unit, and, in thecase of timing contracts, by unit and macro. This was onecomponent of the 2 4 -hour-turnaround time achieved forthe chip-integration design cycle. Figure 26 shows theresults of iterating this process: A histogram of the finalnominal path delays obtained from static timing for thePOWER4 processor.

    The POWER4 design includes LBIST and ABIST(Logic/Array Built-In Self-Test) capability to enable full-frequency ac testing of the logic and arrays. Such testingon pre-final POWER4 chips revealed that several circuitmacros ran slower than predicted from static timing. Thespeed of the critical paths in these macros was increasedin the final design. Typical fast ac LBIST laboratory testresults measured on POWER4 after these paths wereimproved are shown in Figure 27.

    SummaryThe 1 7 4 -million-transistor !1 .3 -GHz POWER4 chip,containing two microprocessor cores and an on-chipmemory subsystem, is a large, complex, high-frequencychip designed by a multi-site design team. Theperformance and schedule goals set at the beginning ofthe project were met successfully. This paper describesthe circuit and physical design of POWER4 , emphasizingaspects that were important to the project’s success in theareas of design methodology, clock distribution, circuits,power, integration, and timing.

    Figure 25

    POWER4 timing flow. This process was iterated daily during the physical design phase to close timing.

    VIM

    Timer files ReportsAsserts

    Spice

    Spice

    GL/1

    Reports

    < 12 hr

    < 12 hr

    < 12 hr

    < 48 hr

    < 24 hr

    Non-uplift timing

    Noiseimpacton timing

    Upliftanalysis

    Capacitanceadjust

    Chipbench /EinsTimer

    Chipbench /EinsTimer

    Extraction

    Core or chipwiring

    Analysis/update(wires, buffers)

    Notes:• Executed 2– 3 months prior to tape-out• Fully extracted data from routed designs • Hierarchical extraction• Custom logic handled separately • Dracula • Harmony• Extraction done for • Early • Late

    Extracted units (flat or hierarchical)Incrementally extracted RLMsCustom NDRsVIMs

    Figure 26

    Histogram of the POWER4 processor path delays.

    !40 !20 0 20 40 6 0 80 100 120 140 16 0 180 200 220 240 26 0 280Timing slack (ps)

    Lat

    e-m

    ode

    timin

    g ch

    ecks

    (th

    ousa

    nds)

    0

    50

    100

    150

    200

    IBM J. RES. & DEV. VOL. 4 6 NO. 1 JANUARY 2 0 0 2 J. D. WARNOCK ET AL.

    47

    Most wires have hundreds of picoseconds to spare.The critical path

  • CS250, UC Berkeley Fall ‘20Lecture 09

    3636

    Active Power ReductionActive Power Reduction

    Slow Fast Slow

    Lo

    w S

    up

    ply

    Vo

    ltag

    e

    Hig

    h S

    up

    ply

    Vo

    ltag

    e

    Multiple Supply

    Voltages

    Logic BlockFreq = 1

    Vdd = 1

    Throughput = 1

    Power = 1

    Area = 1

    Pwr Den = 1

    Vdd

    Logic Block

    Freq = 0.5

    Vdd = 0.5

    Throughput = 1

    Power = 0.25

    Area = 2

    Pwr Den = 0.125

    Vdd/2

    Logic Block

    Replicated Designs

    Useseveralsupplyvoltagesonachip...

    38

    Whyusemulti-Vdd?Wecanreducedynamicpowerbyusinglow-powerVddforlogicoffthecriticalpath.

    Whatifwecan’tdoamulti-Vdddesign?Inamulti-Vtprocess,wecanreduceleakagepowerontheoffcriticalpathlogicbyusinghigh-Vthtransistors.

  • CS250, UC Berkeley Fall ‘20Lecture 09

    ClockGatingReducesClockLoad

    39

    “Upto70%powersavingsattheblocklevel,forapplicablecircuits”SynopsisDataSheet

  • CS250, UC Berkeley Fall ‘20Lecture 09

    Keepchipcooltominimizeleakage

    40

    Optimizing Des igns for Power Cons umption through Changes to the FPGA Environment

    WP28 5 (v1.0) February 14, 2008 www.xilinx.com 7

    R

    Optimizing Designs for Power Consumption through Changes to the FPGA Environment

    To optimize the power consumption in any design, certain things can be done independently of the design contained within the FPGA. Knowing one's environment, e.g., operating temperature and core voltage, is therefore important.

    Temperature ControlControlling temperature not only helps with reliability, as described in the “Thermal Considerations and Reliability” section, but it also reduces static power. For example, a reduction in junction temperature from 100°C to 85°C reduces static power by ~ 20%, as shown previously in Figure 1 and with greater detail in Figure 3 .The static power of Virtex-4 and Virtex-5 FPGAs is already reasonable. However, reducing it by another 20% is valuable because in some designs, the static power of the FPGA represents a sizeable portion (3 0-40%) of the total power budget. A reduction in junction temperature can be achieved by increased airflow and larger heat sinks. The reduction in junction temperature also has the added benefit of increasing reliability as shown in the “Thermal Considerations and Reliability” section.

    Static power is a function of die temperature (TJ), and TJ is a function of how much power the device is consuming, the thermal properties of that device, and its package. Consequently, the FPGA’s ability to transfer the resultant heat to the surrounding environment, via the component packaging, is very important.Heat flows out of the die from the top of the FPGA and into the package balls and PCB, so it is important to understand the system model (PCB, FPGAs, heat sinks, airflow, and other components in a system). See Figure 4.

    X-Ref Target - Figure 3

    Figu re 3 : ICCINTQ vs . J unction Temperature with Increas e Relative to 2 5 °C

    -40 -20 200 40 6 0 80 100 120 140

    25°C

    50°C

    WP285_03_021208

    25

    50

    80°C

    100°C

    I CC

    INT

    Q L

    eaka

    ge C

    urre

    nt(N

    orm

    aliz

    ed to

    25°

    C)

    Junction Temp °C

    JunctionTemperature

    (TJ °C)

    NormalizedStatic Poweror ICCINTQ

    Typical

    85

    100

    1.00

    1.46

    2.50

    3.14

    1

    2

    3

    4

    5

    6

    7

    Optimizing Des igns for Power Cons umption through Changes to the FPGA Environment

    WP28 5 (v1.0) February 14, 2008 www.xilinx.com 7

    R

    Optimizing Designs for Power Consumption through Changes to the FPGA Environment

    To optimize the power consumption in any design, certain things can be done independently of the design contained within the FPGA. Knowing one's environment, e.g., operating temperature and core voltage, is therefore important.

    Temperature ControlControlling temperature not only helps with reliability, as described in the “Thermal Considerations and Reliability” section, but it also reduces static power. For example, a reduction in junction temperature from 100°C to 85°C reduces static power by ~ 20%, as shown previously in Figure 1 and with greater detail in Figure 3 .The static power of Virtex-4 and Virtex-5 FPGAs is already reasonable. However, reducing it by another 20% is valuable because in some designs, the static power of the FPGA represents a sizeable portion (3 0-40%) of the total power budget. A reduction in junction temperature can be achieved by increased airflow and larger heat sinks. The reduction in junction temperature also has the added benefit of increasing reliability as shown in the “Thermal Considerations and Reliability” section.

    Static power is a function of die temperature (TJ), and TJ is a function of how much power the device is consuming, the thermal properties of that device, and its package. Consequently, the FPGA’s ability to transfer the resultant heat to the surrounding environment, via the component packaging, is very important.Heat flows out of the die from the top of the FPGA and into the package balls and PCB, so it is important to understand the system model (PCB, FPGAs, heat sinks, airflow, and other components in a system). See Figure 4.

    X-Ref Target - Figure 3

    Figu re 3 : ICCINTQ vs . J unction Temperature with Increas e Relative to 2 5 °C

    -40 -20 200 40 6 0 80 100 120 140

    25°C

    50°C

    WP285_03_021208

    25

    50

    80°C

    100°C

    I CC

    INT

    Q L

    eaka

    ge C

    urre

    nt(N

    orm

    aliz

    ed to

    25°

    C)

    Junction Temp °C

    JunctionTemperature

    (TJ °C)

    NormalizedStatic Poweror ICCINTQ

    Typical

    85

    100

    1.00

    1.46

    2.50

    3.14

    1

    2

    3

    4

    5

    6

    7

    Optimizing Des igns for Power Cons umption through Changes to the FPGA Environment

    WP28 5 (v1.0) February 14, 2008 www.xilinx.com 7

    R

    Optimizing Designs for Power Consumption through Changes to the FPGA Environment

    To optimize the power consumption in any design, certain things can be done independently of the design contained within the FPGA. Knowing one's environment, e.g., operating temperature and core voltage, is therefore important.

    Temperature ControlControlling temperature not only helps with reliability, as described in the “Thermal Considerations and Reliability” section, but it also reduces static power. For example, a reduction in junction temperature from 100°C to 85°C reduces static power by ~ 20%, as shown previously in Figure 1 and with greater detail in Figure 3 .The static power of Virtex-4 and Virtex-5 FPGAs is already reasonable. However, reducing it by another 20% is valuable because in some designs, the static power of the FPGA represents a sizeable portion (3 0-40%) of the total power budget. A reduction in junction temperature can be achieved by increased airflow and larger heat sinks. The reduction in junction temperature also has the added benefit of increasing reliability as shown in the “Thermal Considerations and Reliability” section.

    Static power is a function of die temperature (TJ), and TJ is a function of how much power the device is consuming, the thermal properties of that device, and its package. Consequently, the FPGA’s ability to transfer the resultant heat to the surrounding environment, via the component packaging, is very important.Heat flows out of the die from the top of the FPGA and into the package balls and PCB, so it is important to understand the system model (PCB, FPGAs, heat sinks, airflow, and other components in a system). See Figure 4.

    X-Ref Target - Figure 3

    Figu re 3 : ICCINTQ vs . J unction Temperature with Increas e Relative to 2 5 °C

    -40 -20 200 40 6 0 80 100 120 140

    25°C

    50°C

    WP285_03_021208

    25

    50

    80°C

    100°C

    I CC

    INT

    Q L

    eaka

    ge C

    urre

    nt(N

    orm

    aliz

    ed to

    25°

    C)

    Junction Temp °C

    JunctionTemperature

    (TJ °C)

    NormalizedStatic Poweror ICCINTQ

    Typical

    85

    100

    1.00

    1.46

    2.50

    3.14

    1

    2

    3

    4

    5

    6

    7

    A recipe for thermal runaway

  • CS250, UC Berkeley Fall ‘20Lecture 09

    CircuitsTopics:Advanced‣ Clocksandclocking:‣ clockdriversanddistribution‣ skeweffects‣ hold-time‣ clockdomainsandsynchronization‣ Phase-lockedLoops(PLL)/Delay-lockedLoops(DLL)‣ GloballyAsynchronouslocallySynchronous(GALS)clocking

    ‣ Powersupplyanduse‣ Powerdistributionanddecouplingcapacitors‣ DynamicVoltageandFrequencyScaling(DVFS)‣ voltageregulators‣ devicestacking,powergating,clockgating,multi-threshold‣ Multi-voltagesystems‣ chargepumps‣ latch-up/wellplugs

    ‣ Input/Output‣ ElectrostaticDischarge(ESD)suppression/pad-drivers‣ High-speedI/O,Serializer/Deserializer(SerDes)‣ packaging

    41

  • CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2

    EndofLecture9

    42