42
CS250, UC Berkeley Fall ‘20 Lecture 04, Reconfigurable Architectures 2 CS250 VLSI Systems Design Fall 2020 John Wawrzynek with Arya Reais-Parsi

CS250 VLSI Systems Designcs250/fa20/files/lec06... · 2004. 2. 3. · Low-overhead exploitation of application parallelism. 6. ... ⇒ Hybrid processor-core reconfigurable-array

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2

    CS250
VLSISystemsDesign

    Fall2020

    JohnWawrzynek

    with

    AryaReais-Parsi

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    Bringsusto“reconfigurablecomputing”

    ■ Whatisit?■ Standarddefinition:

    Computing via a post-fabrication and spatially programmed connection of processing elements. ‣ ASICimplementationsexcluded–notpost-fabrication

    programmable.

    ‣ FPGAimplementationofaprocessorcoretorunaprogramexcluded-notdirectspatialmappingofproblem.

    ■ Doesthisincludearraysofprocessors?■ ThisdefinitionrestrictsRCtomappingto“fine-grained”

    devices(suchasFPGAs),however,manyofthesameprincipleapplytoarraysofprocessors.

    2

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    SpatialComputation

    ■ Example:

    grade = 0.2 × mt1 + 0.2 × mt2

    + 0.2 × mt3 + 0.4 × project;

    ■ Ahardwareresourceisallocatedanddedicatedforeachoperator(inthiscase,multiplieroradder)inthecomputegraph.

    3

    xx xx

    ++

    +

    0.2 mt1 0.2 mt2 0.4 proj0.2 mt3

    grade

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    TemporalComputation■ Ahardwareresourceistime-

    multiplexedtoimplementtheactionsoftheoperatorsinthecomputegraph.

    ■ Typicalinasequentialprocessor/softwaresolution,howeverpossibleinreconfigurablelogic.

    ■ Inreconfigurablelogicitmightbenecessarytoserializeacomputation:

    ■ Limitedchipsresources■ LimitedI/Obandwidth

    4

    acc1 = mt1 + mt2;acc1 = acc1 + mt3;acc1 = 0.2 x acc1;acc2 = 0.4 x proj;grade = acc1 + acc2;

    controller

    ALU

    mt1 mt1mt3 proj

    acc1acc2

    x

    +

    +

    0.2

    mt1 mt2 0.4 proj

    mt3

    grade

    x

    +Abstract computation-graph

    Implementation

    Reconfigurable Computing permits the full range of spatial, temporal, and mixed computing solutions to best match implementation to task specifics and available hardware.

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    RC,Processors,&ASIC

    5

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    RCStrategy1. Exploitcaseswhereoperationcanbeboundandthen

    reusedalargenumberoftimes.

    2. Customizeforoperatortype,width,andinterconnect.3. Low-overheadexploitationofapplicationparallelism.

    6

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    HybridApproach■ 90/10rule:■ 90percentoftheprogramruntimeisconsumedby10percentofthe

    code(inner-loops).

    ■ Onlysmallportionsofanapplicationbecometheperformancebottlenecks.

    ■ Usually,theseportionsofcodearedataprocessingintensivewithrelativelyfixeddataflowpatterns(littlecontrol)

    ■ Theother90percentofthecodenotperformancecritical.

    7

    ⇒ Hybrid processor-core reconfigurable-array

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    Garp–HybridProcessor

    8

    Function Speedupstrlen (len 16) 1.77strlen (len 1024) 14sort 2.1image median filter 26.9DES (ECB mode) 19.6image dithering 16.3

    Speedups over 4-way superscalarUltraSparc on same process and comparable die size and memory system.

    “Garp: A MIPS Processor with a ReconfigurableCoprocessor”, In Proceedings of the IEEE Symposiumon Field-Programmable Custom Computing Machines(FCCM ‘97, April 16-18, 1997)

    • Pre-generated circuits for common program kernels cached within reconfigurable array and used to accelerate MIPS programs.

    • nSec configuration swap time.• Speedup – tied to single execution

    thread.

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    GarpCC(T.Callahan)

    9

    Compilation time < 2x processor only compilation time.

    Kernels from wavelet image compression. Speedups relative to MIPS processor only.

    “The Garp Architecture and C Compiler”, IEEE Computer, April 2000.

    Kernel raw netforward_wavelet_1 2 1.9forward_wavelet_2 4.1 3.6init_image 6.4 6.4forward_wavelet_3 4.1 3.6forward_wavelet_4 5.2 4.1entropy_encode_1 4 4block_quantize 2.8 2.6RLE_encode 5.8 3.4entropy_encode_2 2.9 1.5

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    AdvantagesofRCoverProcessorCore■ Conventionalprocessorshave

    severalsourcesofinefficiency:

    ■ Heavytime-multiplexingofFunctionUnits(ALUs).

    ■ Instructionissueoverhead.■ Memoryhierarchytodealwith

    memorylatency.

    ■ Operatormismatch

    10

    λλ

    Peak (raw) performance

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    AdvantagesofRC■ Relativetomicroprocessors:onaverageahigher

    percentage of peak (or raw) computational density is achievedwithreconfigurabledevices:

    ■ Fine-grainflexibilityleadstoexploitationofproblemspecificparallelismatmanylevels.

    ■ Also,manydifferentcomputationmodels(orpatterns)canbesupported.Ingeneral,itispossibletomatchproblemcharacteristicstohardware,throughtheuseofproblemspecificarchitecturesandlow-levelcircuitspecialization.

    11

    ■ Spatial mapping of computation versus multiplexing of function units (as in processors) relieves pressure for memory capacity, BW, and promotes local communication patterns.

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    Fine-grainedReconfigurableFabrics

    Homogeneousfine-grainedarraysaremaximallyflexible:

    a. Admitawidevarietyofcomputationalarchitecturesmodels:arraysofprocessors,hybridapproaches,hard-wireddataflow,systolicprocessing,vectorprocessing,etc.

    b. Admitawidevarietyofparallelismmodes:SIMD,MIMD,bit-level,etc.Resourcescanbedeployedtolower-latencywhenrequiredfortightfeedbackloops(notpossiblewithmayparallelarchitecturesthatoptimizeforthroughput).

    c. Supportsmanycompilation/resourcemanagementmodels:Staticallycompiled,dynamicallymapped.

    12

    Safe bet as a future standard device.

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    FPGAsareReconfigurable

    1. Volume/costgraphsdon’taccuratelycapturethepotentialrealcostsandotheradvantages.

    2. Commercialapplicationshavenottakenadvantageofreconfigurability• Xilinx/Altera(Intel)haven’tdonemuchtohelp.• Methodologies/toolsnearlynonexistent.

    Reconfiguration uses:

    ‣ Fieldupgrades⇒productlifeextension,changingrequirements.‣ Insystemboard-leveltestingandfielddiagnostics.‣ Tolerancetofaults.‣ Risk-managementinsystemdevelopment.‣ Runtime reconfiguration ⇒ higher silicon efficiency. ‣ Time-multiplexedpre-designedcircuitstakemaximumuseofresources.‣ Runtimespecializedcircuitgeneration.

    13

    Seemingly obvious point but …

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    Multi-modalComputingTasks

    ■ Mini/Micro-UAVs ■ Onepieceofsiliconforallofsensorprocessing,navigation,communications,planning,logging,etc.

    ■ Atdifferenttimesdifferenttaskstakepriorityandconsumehigherpercentageofresources.

    ■ Otherexample:hand-heldmulti-functiondevicewithGPS,smartimagecapture/analysis,communications.

    14

    A premier application for reconfigurable devices is one with constrained size/weight, need multiple functions at near ASIC performance.

    Multiple ASICs too expensive/big. Processor too slow.Fine-grained reconfigurable devices has the flexibility to efficiently matchtask parallelism over a wide variety of tasks – deployed as needed and reconfigured as needed.

    Mars-rover

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    Soundsgreat,what’sthecatch?

    ■ Lackofprogrammingmodelwithconvenientandeffectivetools.

    ■ Mostsuccessfulcomputingapplicationsusingreconfigurabledevicesinvolvesubstantial“handmapping”.Essentiallycircuitdesign.

    ■ Complexissue,butperhapschangingthefabricdesigncanhelp.

    15

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    RapidRuntimeReconfiguration

    ■ Mightpermitevenhigherefficiencythroughhardwaresharing(multiplexing)andontheflycircuitspecialization.

    ■ Largelyunexploited(unproven)todate.■ Afewresearchprojectshaveexploredthisidea.■ Needtobecareful–multiplexingaddscost.■ Rememberthe“BindingTimePrinciple” Earlier the “instruction” is bound, the less area & delay required for the

    implementation.

    16

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    RapidRuntimeReconfiguration

    1. Time-multiplexingresourcesallowsmoreefficientuseofsilicon(inwaysASICstypicallydonot):

    a. Low-dutycycleor“offcriticalpath”computationstimesharefabricwhilecriticalpathstaysmappedin:

    17

    Why dynamic reconfiguration?

    amount of reconfigurable fabric

    total runtime

    size of maximum efficiency

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    RapidRuntimeReconfiguration

    b. Coursedata-dependent controlflowmapsinonlyusefuldataflow:

    c. Allowabletaskfoot-printmaychangeasothertaskscomeandgoorfaultsoccur.

    Fabric virtualization allows automatic migration up and down in device sizes and eases application development.

    18

    If-then-else

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    RapidRuntimeReconfiguration

    2. RuntimeCircuitSpecialization:

    • Example:fixedcoefficientmultipliersinadaptivefilterchangingvalueatlowrate.

    • Aggressiveconstantpropagation(basedperhapsonruntimeprofiling),reducescircuitsizeanddelay.

    • Coulduse“branch/value/rangeprediction”tomapmostcommoncaseandfaultinexceptionalcases.

    • Canbetemplatebased–“fillintheblanks”,butbetterifweputPPRinruntimeloop!

    • ArrayHWassistedplaceandroutemaymakeitpossible.

    19

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    SCORE:

    ■ A computation model for reconfigurable systems ■ abstractsout:physicalhardwaredetails‣ especiallysizeandnumberofresources

    ■ Goal ■ achievedeviceindependence■ approachdensity/efficiencyofrawhardware■ allowapplicationperformancetoscalebasedonsystem

    resources(w/outhumanintervention)

    20

    Stream Computations Organized for Reconfigurable Execution

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    SCORE–VirtualizedFabricModel

    21

    If-else

    High silicon efficiency: ♦ Only active parts of data-flow consume

    resources.

    ♦ High-duty cycle critical path of computation stays mapped and remaining resources are shared by lower duty cycle paths.

    ♦ Particularly effective for multi-tasking environment with time-varying task requirements.

    ♦ Fabric virtualization with demand paging: • Get most out of available resources by automatically

    time-multiplexing. • Automatic migration up and down in device sizes. • Eases application development.

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    SCOREBasics

    ■ Abstractcomputationisadataflowgraph(Kahnprocessnetwork)

    ■ streamlinksbetweenoperators■ dynamicdataflowrates

    ■ Compilerbreaksupcomputationintocomputepages

    ■ unitofschedulingandvirtualization■ streamlinksbetweenpages

    ■ Virtualcomputepagesare“demand-paged”intoavailablehardwareresourcesasneeded.

    22

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    VirtualHardwareModel■ Dataflow graph is arbitrarily large ■ Hardware has finite resources ■ resourcesvaryamongimplementations

    ■ Dataflow graph is scheduled on the hardware ■ Happens automatically (software) ■ physicalresourcesareabstractedincomputemodel

    ■ Graph composition and node size are data dependent

    23

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    ArchitectureModel

    24

    Hybrid processor: conventional RISC core, reconfigurable array, memory.

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    SerialImplementation

    25

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4Vulcan, Inc. Visit

    SpatialImplementation

    262/3/04

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    ArchitectureModel(cont.)■ ArchitecturemodelandSCORE

    computemodelpermitscalingoverawiderangeofICprocessesanddiesizes.

    ■ 0.13umprocesswith16mmX16mmpayloadsuggests:

    ■ 256compute/memorytiles■ totalof32Klogiccells,0.5Gbitmemory■ RISCcorewith32KbitI/Dcachearea

    equivalentto8tiles.

    27

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    ConfigurableSystemonaChip■ Thismicro-architectureandchipisanexampleCSoC■ SCOREProvidesgeneralframeworkforSoCfamilies■ interconnect/architecturefabric■ softwaremodel‣ compute model for application assembly/scaling ‣ OS/runtime ■ bothfor‣ standard cores ‣ custom, application specific components (hardcoded accelerators)

    28

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    KeyIdea:InterconnectFabric■ Standard/common

    InterconnectFabric

    ■ Mix-and-matchnodesonfabric

    ■ providedifferentresourcebalance

    ■ matchneedsofparticularapplications

    ■ Allusecommoncomputemodel

    ■ sharesoftwareandinfrastructure

    29

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    SampleHybridCSoC-vision2000

    30

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    XilinxVersal2020

    31

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    ReconfigurableArrayDesignResearch

    ‣ Architecture‣ ConfigurableLogicBlocks‣ AlternativetoLUTs(lessarea,delay)

    32

    Many of these topics have been studied, but remain open (or secret).

    As with most design decisions, it comes down to design space exploration and cost/benefit analysis. Ideally, finding the Pareto optimal frontier.

    In FPGA design, it is complicated by the fact that decisions are interrelated. For instance, CLB design strong effects interconnection needs.

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    ReconfigurableArrayResearch‣ Architecture‣ ConfigurableLogicBlocks‣ LUTsize:

    33

    E. Ahmed and J. Rose, "The effect of LUT and cluster size on deep-submicron FPGA performance and density," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 12, no. 3, pp. 288-298, March 2004, doi: 10.1109/TVLSI.2004.824300.

    “Finally, our results show that a LUT size of 4 to 6 and cluster size of between 3-10 provides the best area-delay product for an FPGA”

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    ReconfigurableArrayResearch‣ Architecture‣ ConfigurableLogicBlocks‣ InternalinterconnectionamongLUTs/FFs

    34

    Wenyi Feng, Jonathan Greene, and Alan Mishchenko. 2018. Improving FPGA Performance with a S44 LUT Structure. In FPGA’18: 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Feb. 25–27, 2018, Monterey, CA, USA. ACM, New York, NY, USA, 6 pages. DOI: https://doi.org/10.1145/3174243.3174272

    “we show that mapping to a 7-input LUT structure can approach the performance of 6-input LUTs while retaining the area and static power advantage of 4-input LUTs ”

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    ReconfigurableArrayResearch‣ Architecture‣ ConfigurableLogicBlocks‣ ALUversusLUTbased,processorcores

    ‣ Hybridfine-grained/coarsegrained?

    35

    Marshall, Alan & Stansfield, Tony & Kostarnov, Igor & Vuillemin, Jean & Hutchings, Brad. (1999). A Reconfigurable Arithmetic Array for Multimedia Application.. 135-143. 10.1145/296399.296444.

    A. Podobas, K. Sano and S. Matsuoka, "A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective," in IEEE Access, vol. 8, pp. 146719-146743, 2020, doi: 10.1109/ACCESS.2020.3012084.

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    ReconfigurableArrayResearch‣ Architecture‣ InterconnectionNetwork(relativelyunexplored)‣ CLOSnetwork,Fat-trees,otheradhoctopologies‣ Toolplaceandroutetimecritical

    ‣ ConfigurationStructure‣ Partialreconfigurationgranularity‣ dynamicreconfiguration(reconfigurewhilerunning)‣ multiplecontext‣ debugginginterface

    36

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    ReconfigurableArrayResearch‣ Implementation‣ StandardCellsversusFullCustom

    ‣ Hybrid

    37

    Kim, Jin & Anderson, Jason. (2017). Synthesizable Standard Cell FPGA Fabrics Targetable by the Verilog-to-Routing CAD Flow. ACM Transactions on Reconfigurable Technology and Systems. 10. 1-23. 10.1145/3024063.

    X. Tang, E. Giacomin, A. Alacchi, B. Chauviere and P. Gaillardon, "OpenFPGA: An Opensource Framework Enabling Rapid Prototyping of Customizable FPGAs," 2019 29th International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 2019, pp. 367-374, doi: 10.1109/FPL.2019.00065.

    Tile Area

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    ReconfigurableArrayResearch‣ Implementation‣ Fullcustomfabriclayoutgeneration(process

    portability,rapidDSE):

    ‣ Sticks

    ‣ BAG-BerkeleyAnalogGenerator

    38

    E. Chang et al., "BAG2: A process-portable framework for generator-based AMS circuit design," 2018 IEEE Custom Integrated Circuits Conference (CICC), San Diego, CA, 2018, pp. 1-8, doi: 10.1109/CICC.2018.8357061.

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    ReconfigurableArrayResearch‣ The“RISC”ofFPGAs‣ ReducedComplexityReconfigurableArray(RCRA)‣ Comparedtostate-of-artcommercialarrays,canaverysimplearray

    competewith(orbeat)on:

    ‣ Performance(clockfrequency),areaefficiency,powerefficiency,‣ area/delayproduct,power/delayproduct?

    ‣ DothePPR(partition,place,&route)toolsspeedup?‣ Ideas:‣ Iflocalinterconnectisefficient,thenperhapsclusteringisnot

    necessary.

    ‣ AreLUTsoverkill?‣ Simplerinterconnect?

    39

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    ClassProjects

    ‣ Forpurposesoflearning:‣ Redesignclassic(simplearray):‣ ex:Xilinxxc2064,xc6200

    ‣ Advanced:‣ firstattemptatRCRA‣ multi-contextoptimizedfordynamicreconfiguration‣ hybridcoarse-grainfine-grainarray

    ‣ Howmanyprojects?‣ Howmanyteams?‣ Howtodivideupfunctionalityand/orberedundant?

    40

    Do we want to do architecture research or simply focus on implementation issues? Do you want to do research in implementation issues?

  • CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

    ProjectProposals

    41

  • CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2

    EndofLecture6

    42