Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2
CS250 VLSISystemsDesign
Fall2020
JohnWawrzynek
with
AryaReais-Parsi
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
Bringsusto“reconfigurablecomputing”
■ Whatisit?■ Standarddefinition:
Computing via a post-fabrication and spatially programmed connection of processing elements. ‣ ASICimplementationsexcluded–notpost-fabrication
programmable.
‣ FPGAimplementationofaprocessorcoretorunaprogramexcluded-notdirectspatialmappingofproblem.
■ Doesthisincludearraysofprocessors?■ ThisdefinitionrestrictsRCtomappingto“fine-grained”
devices(suchasFPGAs),however,manyofthesameprincipleapplytoarraysofprocessors.
2
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
SpatialComputation
■ Example:
grade = 0.2 × mt1 + 0.2 × mt2
+ 0.2 × mt3 + 0.4 × project;
■ Ahardwareresourceisallocatedanddedicatedforeachoperator(inthiscase,multiplieroradder)inthecomputegraph.
3
xx xx
++
+
0.2 mt1 0.2 mt2 0.4 proj0.2 mt3
grade
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
TemporalComputation■ Ahardwareresourceistime-
multiplexedtoimplementtheactionsoftheoperatorsinthecomputegraph.
■ Typicalinasequentialprocessor/softwaresolution,howeverpossibleinreconfigurablelogic.
■ Inreconfigurablelogicitmightbenecessarytoserializeacomputation:
■ Limitedchipsresources■ LimitedI/Obandwidth
4
acc1 = mt1 + mt2;acc1 = acc1 + mt3;acc1 = 0.2 x acc1;acc2 = 0.4 x proj;grade = acc1 + acc2;
controller
ALU
mt1 mt1mt3 proj
acc1acc2
x
+
+
0.2
mt1 mt2 0.4 proj
mt3
grade
x
+Abstract computation-graph
Implementation
Reconfigurable Computing permits the full range of spatial, temporal, and mixed computing solutions to best match implementation to task specifics and available hardware.
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
RC,Processors,&ASIC
5
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
RCStrategy1. Exploitcaseswhereoperationcanbeboundandthen
reusedalargenumberoftimes.
2. Customizeforoperatortype,width,andinterconnect.3. Low-overheadexploitationofapplicationparallelism.
6
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
HybridApproach■ 90/10rule:■ 90percentoftheprogramruntimeisconsumedby10percentofthe
code(inner-loops).
■ Onlysmallportionsofanapplicationbecometheperformancebottlenecks.
■ Usually,theseportionsofcodearedataprocessingintensivewithrelativelyfixeddataflowpatterns(littlecontrol)
■ Theother90percentofthecodenotperformancecritical.
7
⇒ Hybrid processor-core reconfigurable-array
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
Garp–HybridProcessor
8
Function Speedupstrlen (len 16) 1.77strlen (len 1024) 14sort 2.1image median filter 26.9DES (ECB mode) 19.6image dithering 16.3
Speedups over 4-way superscalarUltraSparc on same process and comparable die size and memory system.
“Garp: A MIPS Processor with a ReconfigurableCoprocessor”, In Proceedings of the IEEE Symposiumon Field-Programmable Custom Computing Machines(FCCM ‘97, April 16-18, 1997)
• Pre-generated circuits for common program kernels cached within reconfigurable array and used to accelerate MIPS programs.
• nSec configuration swap time.• Speedup – tied to single execution
thread.
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
GarpCC(T.Callahan)
9
Compilation time < 2x processor only compilation time.
Kernels from wavelet image compression. Speedups relative to MIPS processor only.
“The Garp Architecture and C Compiler”, IEEE Computer, April 2000.
Kernel raw netforward_wavelet_1 2 1.9forward_wavelet_2 4.1 3.6init_image 6.4 6.4forward_wavelet_3 4.1 3.6forward_wavelet_4 5.2 4.1entropy_encode_1 4 4block_quantize 2.8 2.6RLE_encode 5.8 3.4entropy_encode_2 2.9 1.5
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
AdvantagesofRCoverProcessorCore■ Conventionalprocessorshave
severalsourcesofinefficiency:
■ Heavytime-multiplexingofFunctionUnits(ALUs).
■ Instructionissueoverhead.■ Memoryhierarchytodealwith
memorylatency.
■ Operatormismatch
10
λλ
Peak (raw) performance
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
AdvantagesofRC■ Relativetomicroprocessors:onaverageahigher
percentage of peak (or raw) computational density is achievedwithreconfigurabledevices:
■ Fine-grainflexibilityleadstoexploitationofproblemspecificparallelismatmanylevels.
■ Also,manydifferentcomputationmodels(orpatterns)canbesupported.Ingeneral,itispossibletomatchproblemcharacteristicstohardware,throughtheuseofproblemspecificarchitecturesandlow-levelcircuitspecialization.
11
■ Spatial mapping of computation versus multiplexing of function units (as in processors) relieves pressure for memory capacity, BW, and promotes local communication patterns.
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
Fine-grainedReconfigurableFabrics
Homogeneousfine-grainedarraysaremaximallyflexible:
a. Admitawidevarietyofcomputationalarchitecturesmodels:arraysofprocessors,hybridapproaches,hard-wireddataflow,systolicprocessing,vectorprocessing,etc.
b. Admitawidevarietyofparallelismmodes:SIMD,MIMD,bit-level,etc.Resourcescanbedeployedtolower-latencywhenrequiredfortightfeedbackloops(notpossiblewithmayparallelarchitecturesthatoptimizeforthroughput).
c. Supportsmanycompilation/resourcemanagementmodels:Staticallycompiled,dynamicallymapped.
12
Safe bet as a future standard device.
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
FPGAsareReconfigurable
1. Volume/costgraphsdon’taccuratelycapturethepotentialrealcostsandotheradvantages.
2. Commercialapplicationshavenottakenadvantageofreconfigurability• Xilinx/Altera(Intel)haven’tdonemuchtohelp.• Methodologies/toolsnearlynonexistent.
Reconfiguration uses:
‣ Fieldupgrades⇒productlifeextension,changingrequirements.‣ Insystemboard-leveltestingandfielddiagnostics.‣ Tolerancetofaults.‣ Risk-managementinsystemdevelopment.‣ Runtime reconfiguration ⇒ higher silicon efficiency. ‣ Time-multiplexedpre-designedcircuitstakemaximumuseofresources.‣ Runtimespecializedcircuitgeneration.
13
Seemingly obvious point but …
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
Multi-modalComputingTasks
■ Mini/Micro-UAVs ■ Onepieceofsiliconforallofsensorprocessing,navigation,communications,planning,logging,etc.
■ Atdifferenttimesdifferenttaskstakepriorityandconsumehigherpercentageofresources.
■ Otherexample:hand-heldmulti-functiondevicewithGPS,smartimagecapture/analysis,communications.
14
A premier application for reconfigurable devices is one with constrained size/weight, need multiple functions at near ASIC performance.
Multiple ASICs too expensive/big. Processor too slow.Fine-grained reconfigurable devices has the flexibility to efficiently matchtask parallelism over a wide variety of tasks – deployed as needed and reconfigured as needed.
Mars-rover
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
Soundsgreat,what’sthecatch?
■ Lackofprogrammingmodelwithconvenientandeffectivetools.
■ Mostsuccessfulcomputingapplicationsusingreconfigurabledevicesinvolvesubstantial“handmapping”.Essentiallycircuitdesign.
■ Complexissue,butperhapschangingthefabricdesigncanhelp.
15
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
RapidRuntimeReconfiguration
■ Mightpermitevenhigherefficiencythroughhardwaresharing(multiplexing)andontheflycircuitspecialization.
■ Largelyunexploited(unproven)todate.■ Afewresearchprojectshaveexploredthisidea.■ Needtobecareful–multiplexingaddscost.■ Rememberthe“BindingTimePrinciple” Earlier the “instruction” is bound, the less area & delay required for the
implementation.
16
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
RapidRuntimeReconfiguration
1. Time-multiplexingresourcesallowsmoreefficientuseofsilicon(inwaysASICstypicallydonot):
a. Low-dutycycleor“offcriticalpath”computationstimesharefabricwhilecriticalpathstaysmappedin:
17
Why dynamic reconfiguration?
amount of reconfigurable fabric
total runtime
size of maximum efficiency
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
RapidRuntimeReconfiguration
b. Coursedata-dependent controlflowmapsinonlyusefuldataflow:
c. Allowabletaskfoot-printmaychangeasothertaskscomeandgoorfaultsoccur.
Fabric virtualization allows automatic migration up and down in device sizes and eases application development.
18
If-then-else
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
RapidRuntimeReconfiguration
2. RuntimeCircuitSpecialization:
• Example:fixedcoefficientmultipliersinadaptivefilterchangingvalueatlowrate.
• Aggressiveconstantpropagation(basedperhapsonruntimeprofiling),reducescircuitsizeanddelay.
• Coulduse“branch/value/rangeprediction”tomapmostcommoncaseandfaultinexceptionalcases.
• Canbetemplatebased–“fillintheblanks”,butbetterifweputPPRinruntimeloop!
• ArrayHWassistedplaceandroutemaymakeitpossible.
19
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
SCORE:
■ A computation model for reconfigurable systems ■ abstractsout:physicalhardwaredetails‣ especiallysizeandnumberofresources
■ Goal ■ achievedeviceindependence■ approachdensity/efficiencyofrawhardware■ allowapplicationperformancetoscalebasedonsystem
resources(w/outhumanintervention)
20
Stream Computations Organized for Reconfigurable Execution
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
SCORE–VirtualizedFabricModel
21
If-else
High silicon efficiency: ♦ Only active parts of data-flow consume
resources.
♦ High-duty cycle critical path of computation stays mapped and remaining resources are shared by lower duty cycle paths.
♦ Particularly effective for multi-tasking environment with time-varying task requirements.
♦ Fabric virtualization with demand paging: • Get most out of available resources by automatically
time-multiplexing. • Automatic migration up and down in device sizes. • Eases application development.
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
SCOREBasics
■ Abstractcomputationisadataflowgraph(Kahnprocessnetwork)
■ streamlinksbetweenoperators■ dynamicdataflowrates
■ Compilerbreaksupcomputationintocomputepages
■ unitofschedulingandvirtualization■ streamlinksbetweenpages
■ Virtualcomputepagesare“demand-paged”intoavailablehardwareresourcesasneeded.
22
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
VirtualHardwareModel■ Dataflow graph is arbitrarily large ■ Hardware has finite resources ■ resourcesvaryamongimplementations
■ Dataflow graph is scheduled on the hardware ■ Happens automatically (software) ■ physicalresourcesareabstractedincomputemodel
■ Graph composition and node size are data dependent
23
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ArchitectureModel
24
Hybrid processor: conventional RISC core, reconfigurable array, memory.
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
SerialImplementation
25
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4Vulcan, Inc. Visit
SpatialImplementation
262/3/04
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ArchitectureModel(cont.)■ ArchitecturemodelandSCORE
computemodelpermitscalingoverawiderangeofICprocessesanddiesizes.
■ 0.13umprocesswith16mmX16mmpayloadsuggests:
■ 256compute/memorytiles■ totalof32Klogiccells,0.5Gbitmemory■ RISCcorewith32KbitI/Dcachearea
equivalentto8tiles.
27
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ConfigurableSystemonaChip■ Thismicro-architectureandchipisanexampleCSoC■ SCOREProvidesgeneralframeworkforSoCfamilies■ interconnect/architecturefabric■ softwaremodel‣ compute model for application assembly/scaling ‣ OS/runtime ■ bothfor‣ standard cores ‣ custom, application specific components (hardcoded accelerators)
28
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
KeyIdea:InterconnectFabric■ Standard/common
InterconnectFabric
■ Mix-and-matchnodesonfabric
■ providedifferentresourcebalance
■ matchneedsofparticularapplications
■ Allusecommoncomputemodel
■ sharesoftwareandinfrastructure
29
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
SampleHybridCSoC-vision2000
30
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
XilinxVersal2020
31
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayDesignResearch
‣ Architecture‣ ConfigurableLogicBlocks‣ AlternativetoLUTs(lessarea,delay)
32
Many of these topics have been studied, but remain open (or secret).
As with most design decisions, it comes down to design space exploration and cost/benefit analysis. Ideally, finding the Pareto optimal frontier.
In FPGA design, it is complicated by the fact that decisions are interrelated. For instance, CLB design strong effects interconnection needs.
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayResearch‣ Architecture‣ ConfigurableLogicBlocks‣ LUTsize:
33
E. Ahmed and J. Rose, "The effect of LUT and cluster size on deep-submicron FPGA performance and density," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 12, no. 3, pp. 288-298, March 2004, doi: 10.1109/TVLSI.2004.824300.
“Finally, our results show that a LUT size of 4 to 6 and cluster size of between 3-10 provides the best area-delay product for an FPGA”
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayResearch‣ Architecture‣ ConfigurableLogicBlocks‣ InternalinterconnectionamongLUTs/FFs
34
Wenyi Feng, Jonathan Greene, and Alan Mishchenko. 2018. Improving FPGA Performance with a S44 LUT Structure. In FPGA’18: 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Feb. 25–27, 2018, Monterey, CA, USA. ACM, New York, NY, USA, 6 pages. DOI: https://doi.org/10.1145/3174243.3174272
“we show that mapping to a 7-input LUT structure can approach the performance of 6-input LUTs while retaining the area and static power advantage of 4-input LUTs ”
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayResearch‣ Architecture‣ ConfigurableLogicBlocks‣ ALUversusLUTbased,processorcores
‣ Hybridfine-grained/coarsegrained?
35
Marshall, Alan & Stansfield, Tony & Kostarnov, Igor & Vuillemin, Jean & Hutchings, Brad. (1999). A Reconfigurable Arithmetic Array for Multimedia Application.. 135-143. 10.1145/296399.296444.
A. Podobas, K. Sano and S. Matsuoka, "A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective," in IEEE Access, vol. 8, pp. 146719-146743, 2020, doi: 10.1109/ACCESS.2020.3012084.
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayResearch‣ Architecture‣ InterconnectionNetwork(relativelyunexplored)‣ CLOSnetwork,Fat-trees,otheradhoctopologies‣ Toolplaceandroutetimecritical
‣ ConfigurationStructure‣ Partialreconfigurationgranularity‣ dynamicreconfiguration(reconfigurewhilerunning)‣ multiplecontext‣ debugginginterface
36
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayResearch‣ Implementation‣ StandardCellsversusFullCustom
‣ Hybrid
37
Kim, Jin & Anderson, Jason. (2017). Synthesizable Standard Cell FPGA Fabrics Targetable by the Verilog-to-Routing CAD Flow. ACM Transactions on Reconfigurable Technology and Systems. 10. 1-23. 10.1145/3024063.
X. Tang, E. Giacomin, A. Alacchi, B. Chauviere and P. Gaillardon, "OpenFPGA: An Opensource Framework Enabling Rapid Prototyping of Customizable FPGAs," 2019 29th International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 2019, pp. 367-374, doi: 10.1109/FPL.2019.00065.
Tile Area
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayResearch‣ Implementation‣ Fullcustomfabriclayoutgeneration(process
portability,rapidDSE):
‣ Sticks
‣ BAG-BerkeleyAnalogGenerator
38
E. Chang et al., "BAG2: A process-portable framework for generator-based AMS circuit design," 2018 IEEE Custom Integrated Circuits Conference (CICC), San Diego, CA, 2018, pp. 1-8, doi: 10.1109/CICC.2018.8357061.
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayResearch‣ The“RISC”ofFPGAs‣ ReducedComplexityReconfigurableArray(RCRA)‣ Comparedtostate-of-artcommercialarrays,canaverysimplearray
competewith(orbeat)on:
‣ Performance(clockfrequency),areaefficiency,powerefficiency,‣ area/delayproduct,power/delayproduct?
‣ DothePPR(partition,place,&route)toolsspeedup?‣ Ideas:‣ Iflocalinterconnectisefficient,thenperhapsclusteringisnot
necessary.
‣ AreLUTsoverkill?‣ Simplerinterconnect?
39
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ClassProjects
‣ Forpurposesoflearning:‣ Redesignclassic(simplearray):‣ ex:Xilinxxc2064,xc6200
‣ Advanced:‣ firstattemptatRCRA‣ multi-contextoptimizedfordynamicreconfiguration‣ hybridcoarse-grainfine-grainarray
‣ Howmanyprojects?‣ Howmanyteams?‣ Howtodivideupfunctionalityand/orberedundant?
40
Do we want to do architecture research or simply focus on implementation issues? Do you want to do research in implementation issues?
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ProjectProposals
41
CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2
EndofLecture6
42