LogCA: A High-Level Performance Model for …...LogCA: A High-Level Performance Model for Hardware Accelerators Muhammad Shoaib Bin Altaf* David A. Wood University of Wisconsin-Madison

LogCA:AHigh-LevelPerformanceModelforHardwareAccelerators

MuhammadShoaibBinAltaf*DavidA.Wood

UniversityofWisconsin-Madison

Everythingshouldbemadeassimpleaspossible,butnotsimpler—AlbertEinstein

*NowatAMDResearch,AustinTX

ExecutiveSummary• Acceleratorsdonotalwaysperformasexpected• Crucialforprogrammersandarchitectstounderstandthefactorswhichaffectperformance

• Simpleanalyticalmodelsbeneficialearlyinthedesignstage• Ourproposal:LogCA– High-levelperformancemodel– Helpidentifydesignbottlenecksandpossibleoptimizations

• Validationacrossvarietyofon-chipandoff-chipaccelerators• Tworetrospectivecasestudiesdemonstratetheusefulnessofthemodel

2

Outline

• Motivation• LogCA• Results• Conclusion

3

WhyNeedaModel?

4

“An accelerator is a separate architectural substructure ... that is architected using adifferent set of objectives than the base processor, ...., the accelerator is tuned toprovide HIGHER PERFORMANCE ….. than with the general-purpose base hardware”

S.PatelandW.Hwu.AcceleratorsArchitectures.Micro2008

M7:NextGenerationSPARCHotchips-262014 Power8Hpctchips-252013

WhyaModel?

5

0.001

0.01

0.1

1

10

Time(m

s)

BlockSize(Bytes)

Host

Accelerator

EncryptionalgorithmonUltraSPARC T2

Break-evenpoint

AcceleratoroutperformsHostoutperforms

Amdahl’sLawforAccelerators

Better

WhyaModel?

6

0.1

1

10

100

Speedu

p

OffloadedData(Bytes)

UltraSPARCT2

SPARCT4

GPUBreak-evenpoints

AdvancedEncryptionStandard(AES)

Runningthesamekernel,acceleratorscanhavedifferentbreak-evenpoints

Better

Outline


7

ThePerformanceModel

• InspiredbyLogP [CACM1996]

• Abstractacceleratorusingfiveparameters– L Latency:Cyclestomovedata– o Overhead:Setupcost– g Granularity:Sizeoftheoff-loadeddata– C Computationalindex:Amountofworkdoneperbyteofdata– A Acceleration:Speedupignoringoverheads

• Sixthparameter𝜷generalizestokernelswithnon-linearcomplexity

8

Host Accelerator

Interface

ThePerformanceModel

• Executionw/oanaccelerator– T0(g)=C0 (g)

• Executionwithoneaccelerator– T1 (g)=o1 (g)+L1(g)+C1(g)

9

T0(g)C0(g)

timeo1(g)L1(g)

C1(g)=#$(&)(

T1(g)

Host Accelerator

Interface

Gain

Granularityindependentlatency• Capturestheeffectofgranularityonspeedup• Speedupboundedbyacceleration– lim&→-

𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝑔 = 𝐴

• Overheadsdominateatsmallergranularities

– 𝑆𝑝𝑒𝑒𝑑𝑢𝑝(𝑔)&67 =#

89:9;<< #

89:

10

0.1

1

10

Spe

edup

(g)

Granularity (Bytes)

A

Amdahl’s law for Accelerators

PerformanceMetrics• Rightamountofoff-loadeddata?• Inspiredfromvectormachinemetrics𝑁?,𝑁@

A

• 𝑔7:Granularityforaspeedupof1– 𝑔7 isessentiallyindependentofacceleration– Identifycomplexityoftheinterface

• 𝑔<A:Granularityforaspeedupof(

B

– IncreasingAalsoincreases𝑔<A

11

0.1

1

10

100

Spe

edup

Granularity (Bytes)

A

𝑔7 𝑔(B

SimpleInterface ComplexInterface

𝒈𝟏 𝒈𝟏LargeSmall

Granularitydependentlatency• SpeedupboundedbycomputationalintensityC/L

– lim&→-

𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝑔 < #: (𝑙𝑖𝑛𝑒𝑎𝑟𝑎𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚𝑠)

• Speedupforsub-linearalgorithmsasymptoticallydecreaseswiththeincreaseingranularity

12

0.1

1

10

Spe

edup

(g)

Granularity (Bytes)

A

𝐶𝐿

0.1

1

10

Spe

edup

(g)

Granularity (Bytes)

A

g Speedup

Sub-linearly

Linearly

Granularitydependentlatency• Computationalintensitymustbegreaterthan1 toachieveanyspeedup

• ComputationalintensityshouldbegreaterthanpeakperformancetoachieveA/2

13S

peed

up

Granularity (Bytes)

𝑔A/2

A/2

1

𝑔7

𝐶𝐿 ≥ 1

A

𝐶𝐿 ≥ 𝐴

Performancemetricshelpprogrammersearlyinthedesigncycle

BottleneckAnalysisusingLogCA

14

0.1

1

10

100

1000

Spe

edup

Granularity (Bytes)

LogCA

L_0.1x

o_0.1x

C_10x

A_10x

• 10Xchangeinparameterè 20%performancegain• Helpsfocusonperformancebottlenecks

AoC

A

oC oCA A

𝐶 𝐿⁄

oC oCA A

Outline


15

ExperimentalMethodology

• Fixed-functionandgeneral-purposeaccelerators– CryptographicacceleratorsonSPARCarchitectures– DiscreteandintegratedGPUs

• Kernelswithvaryingcomplexities– Encryption,Hashing,MatrixMultiplication,FFT,Search,RadixSort

• Retrospectivecasestudies– CryptographicinterfaceinSPARCarchitectures– MemoryinterfaceinGPUs

16

CaseStudyICryptographicInterfaceintheSPARCArchitecture

17

PCIe CryptoAccelerator UltraSPARC T2

SPARCT3 SPARCT4engine

SPARCT4instructions

Conclusion

• Simplemodelseffectiveinpredictingperformanceofaccelerators• Proposedahigh-levelperformancemodelforhardwareaccelerators• Thesemodelshelpprogrammersandarchitectsvisuallyidentifybottlenecksandsuggestoptimizations

• Performancemetricsforprogrammersindecidingtherightamountofoffloadeddata

• Limitationsincludeinabilitytomodelresourcecontention,caches,andirregularmemoryaccesspatterns

18

Questions?

19Source:http://www.medarcade.com/

Documents

LogCA: A High-Level Performance Model for …...LogCA: A High-Level Performance Model for Hardware Accelerators Muhammad Shoaib Bin Altaf* David A. Wood University of Wisconsin-Madison