CS250 VLSI Systems Designcs250/fa20/files/lec05... · 2004. 2. 3. · VLSI Systems Design Fall 2020 John Wawrzynek with Arya Reais-Parsi ... Low-overhead exploitation of application

CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2

CS250 VLSISystemsDesign

Fall2020

JohnWawrzynek

with

AryaReais-Parsi

CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3

ReconfigurableFabricArchitecture:DegreesofFreedom

1. LogicBlocksCapacityandinternalstructureofcombinationlogiccircuitsandstateelement(s),

Clusteringandinternalinterconnect

2. InterconnectionNetworkArchitectureCircuit-switchednotpacket-switched,

Topologyofnetwork

3. ConfigurationArchitecturehowisprogramminginformationloadedanddistributed,

configuration“depth”

4. Hardblocks:RAM,ALUs,ProcessorCores,…Function(s),count,andhowintegratedintothefabric

2

CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3 3

Spring 2013 EECS150 - Lec02-SDS-FPGAs Page

Colorsrepresentdifferenttypesofresources:

LogicBlockRAMDSP(ALUs)ClockingI/OSerialI/O+PCI

Aroutingfabricrunsthroughoutthechiptowireeverythingtogether. 64

XilinxVirtex-5


InterconnectionTopologies‣ Traditional

IslandStyle:

‣ Fromflexlogic,

4

Clos Network

“uses about half the area of the traditional interconnect and uses only 5-7 metal routing layers”


Fat-TreeBasedInterconnect‣ Use“Rent’srule”forproperthickness

5

start-up

Lessons: 1) for efficiency, need to “flatten” lower levels of tree, 2) critical path might be long


EmbeddedHardBlocks‣ Manyimportant

functionsarenotefficientwhenimplementedinthereconfigurablefabric:

‣ multiplication,largememory,processorcores,…

‣ Dedicatedblockstakerelativelylittleareaandthereforecouldgounused.

6

CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3 7

Spring 2013 EECS150 - Lec02-SDS-FPGAs Page

Colorsrepresentdifferenttypesofresources:

LogicBlockRAMDSP(ALUs)ClockingI/OSerialI/O+PCI

Aroutingfabricrunsthroughoutthechiptowireeverythingtogether. 64

XilinxVirtex-5


VirtexDSP48ESlice

8

Efficientimplementationofmultiply,add,bit-wiselogical.

EE141

BlockRAMOverview❑ 36K bits of data total, can be configured as:

▪ 2 independent 18Kb RAMs, or one 36Kb RAM. ❑ Each 36Kb block RAM can be configured as:

▪ 64Kx1 (when cascaded with an adjacent 36Kb block RAM), 32Kx1, 16Kx2, 8Kx4, 4Kx9, 2Kx18, or 1Kx36 memory.

❑ Each 18Kb block RAM can be configured as: ▪ 16Kx1, 8Kx2, 4Kx4, 2Kx9, or 1Kx18 memory.

❑ Write and Read are synchronous operations. ❑ The two ports are symmetrical and totally

independent (can have different clocks), sharing only the stored data.

❑ Each port can be configured in one of the available widths, independent of the other port. The read port width can be different from the write port width for each port.

❑ The memory content can be initialized or cleared by the configuration bitstream.

9

EE141

Ultra-RAMBlocks

10

UltraRAM block is a dual-port synchronous 288Kb RAM with fixed configuration of 4,096 deep and 72 bits wide.


FirstcommercialHybridFPGA‣ XilinxVirtexIIPro‣ January2010‣ 150nmprocess

11


State-of-the-Art-XilinxFPGAs

12

Virtex Ultra-scale


ConfigurationArchitecture‣ Howarethe

programmingbitsloadedanddistributed?

‣ Configurationdepth(numberofstoredon-chipconfigurations)

‣ Sameinterfaceoftencanprovideread-backtosavestate/debug

‣ DesignChallenge:‣ Configurationsare

verylarge(100’sofMbits)

‣ Movingmanybitsoverchipinterfacerequirestimeandenergy

13

Many commercial FPGAs also have an internal reconfiguration controller that allows dynamic self reconfiguration.


InternalReconfiguration‣ Traditionally,longshiftchains:

‣ slow,relativelyenergyefficient‣ “Randomaccess”structureshavebeentried.‣ permitsfine-grainpartialreconfiguration

14

Connections to logic blocks, programmable interconnection points, …


XilinxConfigurationLayout‣ “frame”isunitof

reconfiguration

‣ seriallyloadedintochip

15

‣ Permits“partialreconfiguration”

‣ xc6000tookthistothelimitwithwordlevelconfigurationgranularity.


Multi-contextFPGAs

‣ Rapiddynamicreconfigurationpossible.

‣ What’stheexecutionandprogrammingmodel?

16

Garp: a MIPS processor with a reconfigurable coprocessorPublished 1997Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines

“3-D FPGA”


Bringsusto“reconfigurablecomputing”

■ Whatisit?■ Standarddefinition:

Computing via a post-fabrication and spatially programmed connection of processing elements. ‣ ASICimplementationsexcluded–notpost-fabrication

programmable.

‣ FPGAimplementationofaprocessorcoretorunaprogramexcluded-notdirectspatialmappingofproblem.

■ Doesthisincludearraysofprocessors?■ ThisdefinitionrestrictsRCtomappingto“fine-grained”

devices(suchasFPGAs),however,manyofthesameprincipleapplytoarraysofprocessors.

17


SpatialComputation

■ Example:

grade = 0.2 × mt1 + 0.2 × mt2

+ 0.2 × mt3 + 0.4 × project;

■ Ahardwareresourceisallocatedanddedicatedforeachoperator(inthiscase,multiplieroradder)inthecomputegraph.

18

xx xx

++

+

0.2 mt1 0.2 mt2 0.4 proj0.2 mt3

grade


TemporalComputation■ Ahardwareresourceistime-

multiplexedtoimplementtheactionsoftheoperatorsinthecomputegraph.

■ Typicalinasequentialprocessor/softwaresolution,howeverpossibleinreconfigurablelogic.

■ Inreconfigurablelogicitmightbenecessarytoserializeacomputation:

■ Limitedchipsresources■ LimitedI/Obandwidth

19

acc1 = mt1 + mt2;acc1 = acc1 + mt3;acc1 = 0.2 x acc1;acc2 = 0.4 x proj;grade = acc1 + acc2;

controller

ALU

mt1 mt1mt3 proj

acc1acc2

x

+

+

0.2

mt1 mt2 0.4 proj

mt3

grade

x

+Abstract computation-graph

Implementation

Reconfigurable Computing permits the full range of spatial, temporal, and mixed computing solutions to best match implementation to task specifics and available hardware.


RC,Processors,&ASIC

20


RCStrategy1. Exploitcaseswhereoperationcanbeboundandthen

reusedalargenumberoftimes.

2. Customizeforoperatortype,width,andinterconnect.3. Low-overheadexploitationofapplicationparallelism.

21


HybridApproach■ 90/10rule:■ 90percentoftheprogramruntimeisconsumedby10percentofthe

code(inner-loops).

■ Onlysmallportionsofanapplicationbecometheperformancebottlenecks.

■ Usually,theseportionsofcodearedataprocessingintensivewithrelativelyfixeddataflowpatterns(littlecontrol)

■ Theother90percentofthecodenotperformancecritical.

22

⇒ Hybrid processor-core reconfigurable-array


Garp–HybridProcessor

23

Function Speedupstrlen (len 16) 1.77strlen (len 1024) 14sort 2.1image median filter 26.9DES (ECB mode) 19.6image dithering 16.3

Speedups over 4-way superscalarUltraSparc on same process and comparable die size and memory system.

“Garp: A MIPS Processor with a ReconfigurableCoprocessor”, In Proceedings of the IEEE Symposiumon Field-Programmable Custom Computing Machines(FCCM ‘97, April 16-18, 1997)

• Pre-generated circuits for common program kernels cached within reconfigurable array and used to accelerate MIPS programs.

• nSec configuration swap time.• Speedup – tied to single execution

thread.


GarpCC(T.Callahan)

24

Compilation time < 2x processor only compilation time.

Kernels from wavelet image compression. Speedups relative to MIPS processor only.

“The Garp Architecture and C Compiler”, IEEE Computer, April 2000.

Kernel raw netforward_wavelet_1 2 1.9forward_wavelet_2 4.1 3.6init_image 6.4 6.4forward_wavelet_3 4.1 3.6forward_wavelet_4 5.2 4.1entropy_encode_1 4 4block_quantize 2.8 2.6RLE_encode 5.8 3.4entropy_encode_2 2.9 1.5


AdvantagesofRCoverProcessorCore■ Conventionalprocessorshave

severalsourcesofinefficiency:

■ Heavytime-multiplexingofFunctionUnits(ALUs).

■ Instructionissueoverhead.■ Memoryhierarchytodealwith

memorylatency.

■ Operatormismatch

25

λλ

Peak (raw) performance


AdvantagesofRC■ Relativetomicroprocessors:onaverageahigher

percentage of peak (or raw) computational density is achievedwithreconfigurabledevices:

■ Fine-grainflexibilityleadstoexploitationofproblemspecificparallelismatmanylevels.

■ Also,manydifferentcomputationmodels(orpatterns)canbesupported.Ingeneral,itispossibletomatchproblemcharacteristicstohardware,throughtheuseofproblemspecificarchitecturesandlow-levelcircuitspecialization.

26

■ Spatial mapping of computation versus multiplexing of function units (as in processors) relieves pressure for memory capacity, BW, and promotes local communication patterns.


AdvantagesofRC■ ModernFPGAsmakegoodsystem-levelcomponents:■ RelativelylargenumberofIOs(manyparallelmemoryports).High-BW

communications.

■ Built-inmicroprocessorsandotherblocks.■ Machinesbasedonthesecomponentscaneasilyscalepeak

performancebyridingMoore’scurve(FPGAsareprocessdrivers).

■ Low-levelredundancycouldpermitsdefect-toleranceandgreatcostsavings.

■ IstherestillroomforresearchinnoveldevicesforRC?

27


FPGAsareReconfigurable

1. Volume/costgraphsdon’taccuratelycapturethepotentialrealcostsandotheradvantages.

2. Commercialapplicationshavenottakenadvantageofreconfigurability• Xilinx/Altera(Intel)haven’tdonemuchtohelp.• Methodologies/toolsnearlynonexistent.

Reconfiguration uses:

‣ Fieldupgrades⇒productlifeextension,changingrequirements.‣ Insystemboard-leveltestingandfielddiagnostics.‣ Tolerancetofaults.‣ Risk-managementinsystemdevelopment.‣ Runtime reconfiguration ⇒ higher silicon efficiency. ‣ Time-multiplexedpre-designedcircuitstakemaximumuseofresources.‣ Runtimespecializedcircuitgeneration.

28

Seemingly obvious point but …


Multi-modalComputingTasks

■ Mini/Micro-UAVs ■ Onepieceofsiliconforallofsensorprocessing,navigation,communications,planning,logging,etc.

■ Atdifferenttimesdifferenttaskstakepriorityandconsumehigherpercentageofresources.

■ Otherexample:hand-heldmulti-functiondevicewithGPS,smartimagecapture/analysis,communications.

29

A premier application for reconfigurable devices is one with constrained size/weight, need multiple functions at near ASIC performance.

Multiple ASICs too expensive/big. Processor too slow.Fine-grained reconfigurable devices has the flexibility to efficiently matchtask parallelism over a wide variety of tasks – deployed as needed and reconfigured as needed.

Mars-rover


Soundsgreat,what’sthecatch?

■ Lackofprogrammingmodelwithconvenientandeffectivetools.

■ Mostsuccessfulcomputingapplicationsusingreconfigurabledevicesinvolvesubstantial“handmapping”.Essentiallycircuitdesign.

■ Complexissue,butperhapschangingthefabricdesigncanhelp.

30


Fine-grainedReconfigurableFabrics

Homogeneousfine-grainedarraysaremaximallyflexible:

a. Admitawidevarietyofcomputationalarchitecturesmodels:arraysofprocessors,hybridapproaches,hard-wireddataflow,systolicprocessing,vectorprocessing,etc.

b. Admitawidevarietyofparallelismmodes:SIMD,MIMD,bit-level,etc.Resourcescanbedeployedtolower-latencywhenrequiredfortightfeedbackloops(notpossiblewithmayparallelarchitecturesthatoptimizeforthroughput).

c. Supportsmanycompilation/resourcemanagementmodels:Staticallycompiled,dynamicallymapped.

31

Safe bet as a future standard device.


RapidRuntimeReconfiguration

■ Mightpermitevenhigherefficiencythroughhardwaresharing(multiplexing)andontheflycircuitspecialization.

■ Largelyunexploited(unproven)todate.■ Afewresearchprojectshaveexploredthisidea.■ Needtobecareful–multiplexingaddscost.■ Rememberthe“BindingTimePrinciple” Earlier the “instruction” is bound, the less area & delay required for the

implementation.

32



1. Time-multiplexingresourcesallowsmoreefficientuseofsilicon(inwaysASICstypicallydonot):

a. Low-dutycycleor“offcriticalpath”computationstimesharefabricwhilecriticalpathstaysmappedin:

33

Why dynamic reconfiguration?

amount of reconfigurable fabric

total runtime

size of maximum efficiency



b. Coursedata-dependent controlflowmapsinonlyusefuldataflow:

c. Allowabletaskfoot-printmaychangeasothertaskscomeandgoorfaultsoccur.

Fabric virtualization allows automatic migration up and down in device sizes and eases application development.

34

If-then-else



2. RuntimeCircuitSpecialization:

• Example:fixedcoefficientmultipliersinadaptivefilterchangingvalueatlowrate.

• Aggressiveconstantpropagation(basedperhapsonruntimeprofiling),reducescircuitsizeanddelay.

• Coulduse“branch/value/rangeprediction”tomapmostcommoncaseandfaultinexceptionalcases.

• Canbetemplatebased–“fillintheblanks”,butbetterifweputPPRinruntimeloop!

• ArrayHWassistedplaceandroutemaymakeitpossible.

35


SCORE–VirtualizedFabricModel

36

If-else

High silicon efficiency: ♦ Only active parts of data-flow consume

resources.

♦ High-duty cycle critical path of computation stays mapped and remaining resources are shared by lower duty cycle paths.

♦ Particularly effective for multi-tasking environment with time-varying task requirements.

♦ Fabric virtualization with demand paging: • Get most out of available resources by automatically

time-multiplexing. • Automatic migration up and down in device sizes. • Eases application development.


SCORE:

■ A computation model for reconfigurable systems ■ abstractsout:physicalhardwaredetails‣ especiallysizeandnumberofresources

■ Goal ■ achievedeviceindependence■ approachdensity/efficiencyofrawhardware■ allowapplicationperformancetoscalebasedonsystem

resources(w/outhumanintervention)

37

Stream Computations Organized for Reconfigurable Execution


SCOREBasics

■ Abstractcomputationisadataflowgraph

■ streamlinksbetweenoperators■ dynamicdataflowrates

■ Compilerbreaksupcomputationintocomputepages

■ unitofschedulingandvirtualization■ streamlinksbetweenpages

■ Virtualcomputepagesare“demand-paged”intoavailablehardwareresourcesasneeded.

38


VirtualHardwareModel■ Dataflow graph is arbitrarily large ■ Hardware has finite resources ■ resourcesvaryamongimplementations

■ Dataflow graph is scheduled on the hardware ■ Happens automatically (software) ■ physicalresourcesareabstractedincomputemodel

■ Graph composition and node size are data dependent

39


ArchitectureModel

40

Hybrid processor: conventional RISC core, reconfigurable array, memory.


SerialImplementation

41

CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3Vulcan, Inc. Visit

SpatialImplementation

422/3/04


ArchitectureModel(cont.)■ ArchitecturemodelandSCORE

computemodelpermitscalingoverawiderangeofICprocessesanddiesizes.

■ 0.13umprocesswith16mmX16mmpayloadsuggests:

■ 256compute/memorytiles■ totalof32Klogiccells,0.5Gbitmemory■ RISCcorewith32KbitI/Dcachearea

equivalentto8tiles.

43


ConfigurableSystemonaChip■ Thismicro-architectureandchipisanexampleCSoC■ SCOREProvidesgeneralframeworkforSoCfamilies■ interconnect/architecturefabric■ softwaremodel‣ compute model for application assembly/scaling ‣ OS/runtime ■ bothfor‣ standard cores ‣ custom, application specific components (hardcoded accelerators)

44


KeyIdea:InterconnectFabric■ Standard/common

InterconnectFabric

■ Mix-and-matchnodesonfabric

■ providedifferentresourcebalance

■ matchneedsofparticularapplications

■ Allusecommoncomputemodel

■ sharesoftwareandinfrastructure

45


SampleHybridCSoC-vision2000

46


XilinxVersal2020

47

CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2

EndofLecture5

48

Documents

CS250 VLSI Systems Designcs250/fa20/files/lec05... · 2004. 2. 3. · VLSI Systems Design Fall 2020 John Wawrzynek with Arya Reais-Parsi ... Low-overhead exploitation of application