Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
LogCA:AHigh-LevelPerformanceModelforHardwareAccelerators
MuhammadShoaibBinAltaf*DavidA.Wood
UniversityofWisconsin-Madison
Everythingshouldbemadeassimpleaspossible,butnotsimpler—AlbertEinstein
*NowatAMDResearch,AustinTX
ExecutiveSummary• Acceleratorsdonotalwaysperformasexpected• Crucialforprogrammersandarchitectstounderstandthefactorswhichaffectperformance
• Simpleanalyticalmodelsbeneficialearlyinthedesignstage• Ourproposal:LogCA– High-levelperformancemodel– Helpidentifydesignbottlenecksandpossibleoptimizations
• Validationacrossvarietyofon-chipandoff-chipaccelerators• Tworetrospectivecasestudiesdemonstratetheusefulnessofthemodel
2
Outline
• Motivation• LogCA• Results• Conclusion
3
WhyNeedaModel?
4
“An accelerator is a separate architectural substructure ... that is architected using adifferent set of objectives than the base processor, ...., the accelerator is tuned toprovide HIGHER PERFORMANCE ….. than with the general-purpose base hardware”
S.PatelandW.Hwu.AcceleratorsArchitectures.Micro2008
M7:NextGenerationSPARCHotchips-262014 Power8Hpctchips-252013
WhyaModel?
5
0.001
0.01
0.1
1
10
Time(m
s)
BlockSize(Bytes)
Host
Accelerator
EncryptionalgorithmonUltraSPARC T2
Break-evenpoint
AcceleratoroutperformsHostoutperforms
Amdahl’sLawforAccelerators
Better
WhyaModel?
6
0.1
1
10
100
Speedu
p
OffloadedData(Bytes)
UltraSPARCT2
SPARCT4
GPUBreak-evenpoints
AdvancedEncryptionStandard(AES)
Runningthesamekernel,acceleratorscanhavedifferentbreak-evenpoints
Better
Outline
• Motivation• LogCA• Results• Conclusion
7
ThePerformanceModel
• InspiredbyLogP [CACM1996]
• Abstractacceleratorusingfiveparameters– L Latency:Cyclestomovedata– o Overhead:Setupcost– g Granularity:Sizeoftheoff-loadeddata– C Computationalindex:Amountofworkdoneperbyteofdata– A Acceleration:Speedupignoringoverheads
• Sixthparameter𝜷generalizestokernelswithnon-linearcomplexity
8
Host Accelerator
Interface
ThePerformanceModel
• Executionw/oanaccelerator– T0(g)=C0 (g)
• Executionwithoneaccelerator– T1 (g)=o1 (g)+L1(g)+C1(g)
9
T0(g)C0(g)
timeo1(g)L1(g)
C1(g)=#$(&)(
T1(g)
Host Accelerator
Interface
Gain
Granularityindependentlatency• Capturestheeffectofgranularityonspeedup• Speedupboundedbyacceleration– lim&→-
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝑔 = 𝐴
• Overheadsdominateatsmallergranularities
– 𝑆𝑝𝑒𝑒𝑑𝑢𝑝(𝑔)&67 =#
89:9;<< #
89:
10
0.1
1
10
Spe
edup
(g)
Granularity (Bytes)
A
Amdahl’s law for Accelerators
PerformanceMetrics• Rightamountofoff-loadeddata?• Inspiredfromvectormachinemetrics𝑁?,𝑁@
A
• 𝑔7:Granularityforaspeedupof1– 𝑔7 isessentiallyindependentofacceleration– Identifycomplexityoftheinterface
• 𝑔<A:Granularityforaspeedupof(
B
– IncreasingAalsoincreases𝑔<A
11
0.1
1
10
100
Spe
edup
Granularity (Bytes)
A
𝑔7 𝑔(B
SimpleInterface ComplexInterface
𝒈𝟏 𝒈𝟏LargeSmall
Granularitydependentlatency• SpeedupboundedbycomputationalintensityC/L
– lim&→-
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝑔 < #: (𝑙𝑖𝑛𝑒𝑎𝑟𝑎𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚𝑠)
• Speedupforsub-linearalgorithmsasymptoticallydecreaseswiththeincreaseingranularity
12
0.1
1
10
Spe
edup
(g)
Granularity (Bytes)
A
𝐶𝐿
0.1
1
10
Spe
edup
(g)
Granularity (Bytes)
A
g Speedup
Sub-linearly
Linearly
Granularitydependentlatency• Computationalintensitymustbegreaterthan1 toachieveanyspeedup
• ComputationalintensityshouldbegreaterthanpeakperformancetoachieveA/2
13S
peed
up
Granularity (Bytes)
𝑔A/2
A/2
1
𝑔7
𝐶𝐿 ≥ 1
A
𝐶𝐿 ≥ 𝐴
Performancemetricshelpprogrammersearlyinthedesigncycle
BottleneckAnalysisusingLogCA
14
0.1
1
10
100
1000
Spe
edup
Granularity (Bytes)
LogCA
L_0.1x
o_0.1x
C_10x
A_10x
• 10Xchangeinparameterè 20%performancegain• Helpsfocusonperformancebottlenecks
AoC
A
oC oCA A
𝐶 𝐿⁄
oC oCA A
Outline
• Motivation• LogCA• Results• Conclusion
15
ExperimentalMethodology
• Fixed-functionandgeneral-purposeaccelerators– CryptographicacceleratorsonSPARCarchitectures– DiscreteandintegratedGPUs
• Kernelswithvaryingcomplexities– Encryption,Hashing,MatrixMultiplication,FFT,Search,RadixSort
• Retrospectivecasestudies– CryptographicinterfaceinSPARCarchitectures– MemoryinterfaceinGPUs
16
CaseStudyICryptographicInterfaceintheSPARCArchitecture
17
PCIe CryptoAccelerator UltraSPARC T2
SPARCT3 SPARCT4engine
SPARCT4instructions
Conclusion
• Simplemodelseffectiveinpredictingperformanceofaccelerators• Proposedahigh-levelperformancemodelforhardwareaccelerators• Thesemodelshelpprogrammersandarchitectsvisuallyidentifybottlenecksandsuggestoptimizations
• Performancemetricsforprogrammersindecidingtherightamountofoffloadeddata
• Limitationsincludeinabilitytomodelresourcecontention,caches,andirregularmemoryaccesspatterns
18
Questions?
19Source:http://www.medarcade.com/