48
CS250, UC Berkeley Fall ‘20 Lecture 04, Reconfigurable Architectures 2 CS250 VLSI Systems Design Fall 2020 John Wawrzynek with Arya Reais-Parsi

CS250 VLSI Systems Designcs250/fa20/files/lec05... · 2004. 2. 3. · VLSI Systems Design Fall 2020 John Wawrzynek with Arya Reais-Parsi ... Low-overhead exploitation of application

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

  • CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2

    CS250
VLSISystemsDesign

    Fall2020

    JohnWawrzynek

    with

    AryaReais-Parsi

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    ReconfigurableFabricArchitecture:DegreesofFreedom

    1. LogicBlocksCapacityandinternalstructureofcombinationlogiccircuitsandstateelement(s),

    Clusteringandinternalinterconnect

    2. InterconnectionNetworkArchitectureCircuit-switchednotpacket-switched,

    Topologyofnetwork

    3. ConfigurationArchitecturehowisprogramminginformationloadedanddistributed,

    configuration“depth”

    4. Hardblocks:RAM,ALUs,ProcessorCores,…Function(s),count,andhowintegratedintothefabric

    2

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3 3

    Spring 2013 EECS150 - Lec02-SDS-FPGAs Page

    Colorsrepresentdifferenttypesofresources:

    LogicBlockRAMDSP(ALUs)ClockingI/OSerialI/O+PCI

    Aroutingfabricrunsthroughoutthechiptowireeverythingtogether. 64

    XilinxVirtex-5

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    InterconnectionTopologies‣ Traditional

    IslandStyle:

    ‣ Fromflexlogic,

    4

    Clos Network

    “uses about half the area of the traditional interconnect and uses only 5-7 metal routing layers”

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    Fat-TreeBasedInterconnect‣ Use“Rent’srule”forproperthickness

    5

    start-up

    Lessons: 1) for efficiency, need to “flatten” lower levels of tree, 2) critical path might be long

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    EmbeddedHardBlocks‣ Manyimportant

    functionsarenotefficientwhenimplementedinthereconfigurablefabric:

    ‣ multiplication,largememory,processorcores,…

    ‣ Dedicatedblockstakerelativelylittleareaandthereforecouldgounused.

    6

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3 7

    Spring 2013 EECS150 - Lec02-SDS-FPGAs Page

    Colorsrepresentdifferenttypesofresources:

    LogicBlockRAMDSP(ALUs)ClockingI/OSerialI/O+PCI

    Aroutingfabricrunsthroughoutthechiptowireeverythingtogether. 64

    XilinxVirtex-5

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    VirtexDSP48ESlice

    8

    Efficientimplementationofmultiply,add,bit-wiselogical.

  • EE141

    BlockRAMOverview❑ 36K bits of data total, can be configured as:

    ▪ 2 independent 18Kb RAMs, or one 36Kb RAM. ❑ Each 36Kb block RAM can be configured as:

    ▪ 64Kx1 (when cascaded with an adjacent 36Kb block RAM), 32Kx1, 16Kx2, 8Kx4, 4Kx9, 2Kx18, or 1Kx36 memory.

    ❑ Each 18Kb block RAM can be configured as: ▪ 16Kx1, 8Kx2, 4Kx4, 2Kx9, or 1Kx18 memory.

    ❑ Write and Read are synchronous operations. ❑ The two ports are symmetrical and totally

    independent (can have different clocks), sharing only the stored data.

    ❑ Each port can be configured in one of the available widths, independent of the other port. The read port width can be different from the write port width for each port.

    ❑ The memory content can be initialized or cleared by the configuration bitstream.

    9

  • EE141

    Ultra-RAMBlocks

    10

    UltraRAM block is a dual-port synchronous 288Kb RAM with fixed configuration of 4,096 deep and 72 bits wide.

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    FirstcommercialHybridFPGA‣ XilinxVirtexIIPro‣ January2010‣ 150nmprocess

    11

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    State-of-the-Art-XilinxFPGAs

    12

    Virtex Ultra-scale

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    ConfigurationArchitecture‣ Howarethe

    programmingbitsloadedanddistributed?

    ‣ Configurationdepth(numberofstoredon-chipconfigurations)

    ‣ Sameinterfaceoftencanprovideread-backtosavestate/debug

    ‣ DesignChallenge:‣ Configurationsare

    verylarge(100’sofMbits)

    ‣ Movingmanybitsoverchipinterfacerequirestimeandenergy

    13

    Many commercial FPGAs also have an internal reconfiguration controller that allows dynamic self reconfiguration.

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    InternalReconfiguration‣ Traditionally,longshiftchains:

    ‣ slow,relativelyenergyefficient‣ “Randomaccess”structureshavebeentried.‣ permitsfine-grainpartialreconfiguration

    14

    Connections to logic blocks, programmable interconnection points, …

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    XilinxConfigurationLayout‣ “frame”isunitof

    reconfiguration

    ‣ seriallyloadedintochip

    15

    ‣ Permits“partialreconfiguration”

    ‣ xc6000tookthistothelimitwithwordlevelconfigurationgranularity.

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    Multi-contextFPGAs

    ‣ Rapiddynamicreconfigurationpossible.

    ‣ What’stheexecutionandprogrammingmodel?

    16

    Garp: a MIPS processor with a reconfigurable coprocessorPublished 1997Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines

    “3-D FPGA”

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    Bringsusto“reconfigurablecomputing”

    ■ Whatisit?■ Standarddefinition:

    Computing via a post-fabrication and spatially programmed connection of processing elements. ‣ ASICimplementationsexcluded–notpost-fabrication

    programmable.

    ‣ FPGAimplementationofaprocessorcoretorunaprogramexcluded-notdirectspatialmappingofproblem.

    ■ Doesthisincludearraysofprocessors?■ ThisdefinitionrestrictsRCtomappingto“fine-grained”

    devices(suchasFPGAs),however,manyofthesameprincipleapplytoarraysofprocessors.

    17

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    SpatialComputation

    ■ Example:

    grade = 0.2 × mt1 + 0.2 × mt2

    + 0.2 × mt3 + 0.4 × project;

    ■ Ahardwareresourceisallocatedanddedicatedforeachoperator(inthiscase,multiplieroradder)inthecomputegraph.

    18

    xx xx

    ++

    +

    0.2 mt1 0.2 mt2 0.4 proj0.2 mt3

    grade

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    TemporalComputation■ Ahardwareresourceistime-

    multiplexedtoimplementtheactionsoftheoperatorsinthecomputegraph.

    ■ Typicalinasequentialprocessor/softwaresolution,howeverpossibleinreconfigurablelogic.

    ■ Inreconfigurablelogicitmightbenecessarytoserializeacomputation:

    ■ Limitedchipsresources■ LimitedI/Obandwidth

    19

    acc1 = mt1 + mt2;acc1 = acc1 + mt3;acc1 = 0.2 x acc1;acc2 = 0.4 x proj;grade = acc1 + acc2;

    controller

    ALU

    mt1 mt1mt3 proj

    acc1acc2

    x

    +

    +

    0.2

    mt1 mt2 0.4 proj

    mt3

    grade

    x

    +Abstract computation-graph

    Implementation

    Reconfigurable Computing permits the full range of spatial, temporal, and mixed computing solutions to best match implementation to task specifics and available hardware.

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    RC,Processors,&ASIC

    20

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    RCStrategy1. Exploitcaseswhereoperationcanbeboundandthen

    reusedalargenumberoftimes.

    2. Customizeforoperatortype,width,andinterconnect.3. Low-overheadexploitationofapplicationparallelism.

    21

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    HybridApproach■ 90/10rule:■ 90percentoftheprogramruntimeisconsumedby10percentofthe

    code(inner-loops).

    ■ Onlysmallportionsofanapplicationbecometheperformancebottlenecks.

    ■ Usually,theseportionsofcodearedataprocessingintensivewithrelativelyfixeddataflowpatterns(littlecontrol)

    ■ Theother90percentofthecodenotperformancecritical.

    22

    ⇒ Hybrid processor-core reconfigurable-array

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    Garp–HybridProcessor

    23

    Function Speedupstrlen (len 16) 1.77strlen (len 1024) 14sort 2.1image median filter 26.9DES (ECB mode) 19.6image dithering 16.3

    Speedups over 4-way superscalarUltraSparc on same process and comparable die size and memory system.

    “Garp: A MIPS Processor with a ReconfigurableCoprocessor”, In Proceedings of the IEEE Symposiumon Field-Programmable Custom Computing Machines(FCCM ‘97, April 16-18, 1997)

    • Pre-generated circuits for common program kernels cached within reconfigurable array and used to accelerate MIPS programs.

    • nSec configuration swap time.• Speedup – tied to single execution

    thread.

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    GarpCC(T.Callahan)

    24

    Compilation time < 2x processor only compilation time.

    Kernels from wavelet image compression. Speedups relative to MIPS processor only.

    “The Garp Architecture and C Compiler”, IEEE Computer, April 2000.

    Kernel raw netforward_wavelet_1 2 1.9forward_wavelet_2 4.1 3.6init_image 6.4 6.4forward_wavelet_3 4.1 3.6forward_wavelet_4 5.2 4.1entropy_encode_1 4 4block_quantize 2.8 2.6RLE_encode 5.8 3.4entropy_encode_2 2.9 1.5

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    AdvantagesofRCoverProcessorCore■ Conventionalprocessorshave

    severalsourcesofinefficiency:

    ■ Heavytime-multiplexingofFunctionUnits(ALUs).

    ■ Instructionissueoverhead.■ Memoryhierarchytodealwith

    memorylatency.

    ■ Operatormismatch

    25

    λλ

    Peak (raw) performance

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    AdvantagesofRC■ Relativetomicroprocessors:onaverageahigher

    percentage of peak (or raw) computational density is achievedwithreconfigurabledevices:

    ■ Fine-grainflexibilityleadstoexploitationofproblemspecificparallelismatmanylevels.

    ■ Also,manydifferentcomputationmodels(orpatterns)canbesupported.Ingeneral,itispossibletomatchproblemcharacteristicstohardware,throughtheuseofproblemspecificarchitecturesandlow-levelcircuitspecialization.

    26

    ■ Spatial mapping of computation versus multiplexing of function units (as in processors) relieves pressure for memory capacity, BW, and promotes local communication patterns.

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    AdvantagesofRC■ ModernFPGAsmakegoodsystem-levelcomponents:■ RelativelylargenumberofIOs(manyparallelmemoryports).High-BW

    communications.

    ■ Built-inmicroprocessorsandotherblocks.■ Machinesbasedonthesecomponentscaneasilyscalepeak

    performancebyridingMoore’scurve(FPGAsareprocessdrivers).

    ■ Low-levelredundancycouldpermitsdefect-toleranceandgreatcostsavings.

    ■ IstherestillroomforresearchinnoveldevicesforRC?

    27

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    FPGAsareReconfigurable

    1. Volume/costgraphsdon’taccuratelycapturethepotentialrealcostsandotheradvantages.

    2. Commercialapplicationshavenottakenadvantageofreconfigurability• Xilinx/Altera(Intel)haven’tdonemuchtohelp.• Methodologies/toolsnearlynonexistent.

    Reconfiguration uses:

    ‣ Fieldupgrades⇒productlifeextension,changingrequirements.‣ Insystemboard-leveltestingandfielddiagnostics.‣ Tolerancetofaults.‣ Risk-managementinsystemdevelopment.‣ Runtime reconfiguration ⇒ higher silicon efficiency. ‣ Time-multiplexedpre-designedcircuitstakemaximumuseofresources.‣ Runtimespecializedcircuitgeneration.

    28

    Seemingly obvious point but …

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    Multi-modalComputingTasks

    ■ Mini/Micro-UAVs ■ Onepieceofsiliconforallofsensorprocessing,navigation,communications,planning,logging,etc.

    ■ Atdifferenttimesdifferenttaskstakepriorityandconsumehigherpercentageofresources.

    ■ Otherexample:hand-heldmulti-functiondevicewithGPS,smartimagecapture/analysis,communications.

    29

    A premier application for reconfigurable devices is one with constrained size/weight, need multiple functions at near ASIC performance.

    Multiple ASICs too expensive/big. Processor too slow.Fine-grained reconfigurable devices has the flexibility to efficiently matchtask parallelism over a wide variety of tasks – deployed as needed and reconfigured as needed.

    Mars-rover

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    Soundsgreat,what’sthecatch?

    ■ Lackofprogrammingmodelwithconvenientandeffectivetools.

    ■ Mostsuccessfulcomputingapplicationsusingreconfigurabledevicesinvolvesubstantial“handmapping”.Essentiallycircuitdesign.

    ■ Complexissue,butperhapschangingthefabricdesigncanhelp.

    30

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    Fine-grainedReconfigurableFabrics

    Homogeneousfine-grainedarraysaremaximallyflexible:

    a. Admitawidevarietyofcomputationalarchitecturesmodels:arraysofprocessors,hybridapproaches,hard-wireddataflow,systolicprocessing,vectorprocessing,etc.

    b. Admitawidevarietyofparallelismmodes:SIMD,MIMD,bit-level,etc.Resourcescanbedeployedtolower-latencywhenrequiredfortightfeedbackloops(notpossiblewithmayparallelarchitecturesthatoptimizeforthroughput).

    c. Supportsmanycompilation/resourcemanagementmodels:Staticallycompiled,dynamicallymapped.

    31

    Safe bet as a future standard device.

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    RapidRuntimeReconfiguration

    ■ Mightpermitevenhigherefficiencythroughhardwaresharing(multiplexing)andontheflycircuitspecialization.

    ■ Largelyunexploited(unproven)todate.■ Afewresearchprojectshaveexploredthisidea.■ Needtobecareful–multiplexingaddscost.■ Rememberthe“BindingTimePrinciple” Earlier the “instruction” is bound, the less area & delay required for the

    implementation.

    32

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    RapidRuntimeReconfiguration

    1. Time-multiplexingresourcesallowsmoreefficientuseofsilicon(inwaysASICstypicallydonot):

    a. Low-dutycycleor“offcriticalpath”computationstimesharefabricwhilecriticalpathstaysmappedin:

    33

    Why dynamic reconfiguration?

    amount of reconfigurable fabric

    total runtime

    size of maximum efficiency

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    RapidRuntimeReconfiguration

    b. Coursedata-dependent controlflowmapsinonlyusefuldataflow:

    c. Allowabletaskfoot-printmaychangeasothertaskscomeandgoorfaultsoccur.

    Fabric virtualization allows automatic migration up and down in device sizes and eases application development.

    34

    If-then-else

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    RapidRuntimeReconfiguration

    2. RuntimeCircuitSpecialization:

    • Example:fixedcoefficientmultipliersinadaptivefilterchangingvalueatlowrate.

    • Aggressiveconstantpropagation(basedperhapsonruntimeprofiling),reducescircuitsizeanddelay.

    • Coulduse“branch/value/rangeprediction”tomapmostcommoncaseandfaultinexceptionalcases.

    • Canbetemplatebased–“fillintheblanks”,butbetterifweputPPRinruntimeloop!

    • ArrayHWassistedplaceandroutemaymakeitpossible.

    35

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    SCORE–VirtualizedFabricModel

    36

    If-else

    High silicon efficiency: ♦ Only active parts of data-flow consume

    resources.

    ♦ High-duty cycle critical path of computation stays mapped and remaining resources are shared by lower duty cycle paths.

    ♦ Particularly effective for multi-tasking environment with time-varying task requirements.

    ♦ Fabric virtualization with demand paging: • Get most out of available resources by automatically

    time-multiplexing. • Automatic migration up and down in device sizes. • Eases application development.

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    SCORE:

    ■ A computation model for reconfigurable systems ■ abstractsout:physicalhardwaredetails‣ especiallysizeandnumberofresources

    ■ Goal ■ achievedeviceindependence■ approachdensity/efficiencyofrawhardware■ allowapplicationperformancetoscalebasedonsystem

    resources(w/outhumanintervention)

    37

    Stream Computations Organized for Reconfigurable Execution

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    SCOREBasics

    ■ Abstractcomputationisadataflowgraph

    ■ streamlinksbetweenoperators■ dynamicdataflowrates

    ■ Compilerbreaksupcomputationintocomputepages

    ■ unitofschedulingandvirtualization■ streamlinksbetweenpages

    ■ Virtualcomputepagesare“demand-paged”intoavailablehardwareresourcesasneeded.

    38

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    VirtualHardwareModel■ Dataflow graph is arbitrarily large ■ Hardware has finite resources ■ resourcesvaryamongimplementations

    ■ Dataflow graph is scheduled on the hardware ■ Happens automatically (software) ■ physicalresourcesareabstractedincomputemodel

    ■ Graph composition and node size are data dependent

    39

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    ArchitectureModel

    40

    Hybrid processor: conventional RISC core, reconfigurable array, memory.

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    SerialImplementation

    41

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3Vulcan, Inc. Visit

    SpatialImplementation

    422/3/04

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    ArchitectureModel(cont.)■ ArchitecturemodelandSCORE

    computemodelpermitscalingoverawiderangeofICprocessesanddiesizes.

    ■ 0.13umprocesswith16mmX16mmpayloadsuggests:

    ■ 256compute/memorytiles■ totalof32Klogiccells,0.5Gbitmemory■ RISCcorewith32KbitI/Dcachearea

    equivalentto8tiles.

    43

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    ConfigurableSystemonaChip■ Thismicro-architectureandchipisanexampleCSoC■ SCOREProvidesgeneralframeworkforSoCfamilies■ interconnect/architecturefabric■ softwaremodel‣ compute model for application assembly/scaling ‣ OS/runtime ■ bothfor‣ standard cores ‣ custom, application specific components (hardcoded accelerators)

    44

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    KeyIdea:InterconnectFabric■ Standard/common

    InterconnectFabric

    ■ Mix-and-matchnodesonfabric

    ■ providedifferentresourcebalance

    ■ matchneedsofparticularapplications

    ■ Allusecommoncomputemodel

    ■ sharesoftwareandinfrastructure

    45

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    SampleHybridCSoC-vision2000

    46

  • CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

    XilinxVersal2020

    47

  • CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2

    EndofLecture5

    48