GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Diﬀerent architecture and programming model from

GPGPU in HPC

1

GPGPU

Scien+ficApplica+ons

So, Errors

2

=0001 =0101

Baumanetal.[TDMR,2005]

Tradi3onal Method: DMR

3

DualModularRedundancy(DMR)

•  Run2copies

•  Comparefordivergence

Toomuchenergyconsump+on!

So,ware Solu3ons

4

Advantages

•  Nohardwaremodifica+on

•  Errorscanbemasked

•  Allowselec+veprotec+on

Applica+onLevel

Opera+ngSystemLevel

ArchitecturalLevel

Device/CircuitLevel

Impact

Challenges for GPGPU Resilience

• DifferentarchitectureandprogrammingmodelfromCPUs

• Noscalablefaultinjec+ontoolsforHPCGPGPUapplica+ons

5

Our Contribu3ons

6

LLFI-GPU:ScalableFault

Injector

Characteriza+onofError

Propaga+on

Implica+onsonErrorMi+ga+ons

Exis3ng Publicly Available GPU Fault Injectors

•  Hauberk[IPDPS,2011]•  Sourcecodelevelfaultinjec+on

•  Notrepresenta+veforhardwareerrors

•  GPU-Qin[ISPASS,2014]•  Debugger-based

•  Execu+onisslow

•  GPGPU-Simbasedfaultinjector[DSN,2015]

•  Notfullsystemsimula+on

7

Goals of LLFI-GPU

•  Na)veSpeed•  Program-levelfaultinjec+on

•  Compiletobinary

•  Fullsystemsimula)on

•  Executeonrealhardware

•  Abletosimulatedifferentfailureoutcomes

•  Representa)veness•  LLVMIRlevelfaultinjec+on

•  Closetoassembly,yetpreservehigh-levelprogramsymbols

8

LLVM (Low Level Virtual Machine)

9

LLFIforCPU

LLFIforCPU:hfps://github.com/DependableSystemsLab/LLFI

LLFI-GPU: Overview

10

.cufiles

LLVMIR

PTXAssembly

SASSAssembly

Instrumenta+onPasses

NVC

CCo

mpila+o

n

LLFI-GPU…

R0=addR1,R2

R0=injectFault(R0)

R4=mulR0,R3

…

Advantages of LLFI-GPU

•  CompileonlargeGPGPUprograms

•  1000xfastercomparedtoGPU-Qin(MatrixMul)

•  Representedsimula+on

•  Fullsystemsimula+onofsoierrors

•  Open-source

•  hfps://github.com/DependableSystemsLab/LLFI-GPU

11

Experiment Setup: Nvidia K20

•  12Benchmarks•  Rodinia&Parboilsuites

•  Lulesh(LLNL),Barns-Hut(TexasStateUniv.),Fiber(NortheasternUniv.),CircuitSolver(RiceUniv.)andNMF(UCBerkley)

•  FaultInjec+on•  10,000perapplica+on(Errorbar:0.22%-2.99%,95%confidencelevel)

•  FaultModel•  Singlebit-flip

•  Transientfaultsinexecu+onunits

12

Failure Outcomes

•  SilentDataCorrup+on(SDC)•  Mismatchinprogramoutputsfromgoldenrunandfaultinjec+onrun

•  Crash•  CUDAexcep+ons(e.g,illegalmemoryaddress)

•  Causekernelexecu+ontohalt

•  Benign•  Noeffectonprogramoutput

13

Our Contribu3ons

14

LLFI-GPU:ScalableFault

Injector

Characteriza+onofError

Propaga+on

Implica+onsonErrorMi+ga+ons

Research Ques3on 1

WhatisthepercentageofSDCsindifferent

memorystates?

15

Memory State

16

TotalMemory(TM)

ResultMemory(RM)

OutputMemory(OM)

…cudaMalloc(M1)cudaMalloc(M2)cudaMemcpy(M1,…)cudaMemcpy(M2,…)…Kernel<<<>>>,……cudaMemcpy(…,M2)…Foreach(M2):if(ele>0){print(ele)}…

SDC in Different Memory States

17

bfs barneshut nmf

SDCTM-SDCRM 0.00% 0.20% 0.00% AverageofSDC(TM-RM)

inallbenchmarks:

0.09%

bfs barneshut nmf

RM 14.29% 37.50% 0.03%

TM 100% 100% 100%

SizeofStates

SDCofStates

AveragesizeofRM

inallbenchmarks:

13.56%CheckingRMreduces~86%overhead

whileretainingcoverage

MostofthefaultsinTMpropagateRM

Example of Checkers

• Checkvaluerangeofpar+cularstates•  Calcula+ngangle:if(angle>60orangle<0){errordetected}

• Overheadisdirectlypropor+onaltothenumberofstateschecked

•  CheckingRMreduces~86%overhead

•  Smalllossofcoverage

18

Pafabiramanetal.[TDSC,2011]Harietal.[DSN,2012]

Research Ques3on 2

HowlongdoerrorstaketopropagatetotheRM?

19

Metrics: Kernel Call

20

…tomeasurepropaga+on+meoferror

GPUExecu)on

CPUExecu)on

Kernel1 Kernel2 Kernel3

ErrorDetector ErrorDetector ErrorDetector

ErrorDetected

ErrorOccurred

Errordetec+onlatencyis2

Tracking Error Propaga3on

21

…Kernel1<<<>>>DumpToDisk(TM)DumpToDisk(RM)DumpToDisk(OM)Kernel2<<<>>>…

Comparedwithgoldencopyforanydatacorrup+ons

Propaga3on Latency to RM

22

CheckingRMprovides

shortdetec+onlatency

Implica3ons

•  RMisanarrowtunnelwherefaultsfrequentlypropagatethrough•  CheckingRMforSDCisabefertrade-off

•  Crash-causingfaultsrarelypropagateacrosskernelcalls•  DeployinghighfrequencycheckpointsforGPGPUcanavoidcheckpointcorrup+ons

•  Studiedon2GPGPUplavorms(NvidiaGTX960&NvidiaK20)•  Resultsaresta+s+callyindis+nguishable

•  Inves+gatedinerrorspread&maskingetc•  … moreinteres+ngfindingscanbefoundinthepaper!

23

Summary

• DesignedascalablefaultinjectorforGPGPUs:LLFI-GPU• Characterizederrorpropaga+onpafernsinGPGPUapplica+ons• Discussedtheirimplica+onsonerrormi+ga+ontechniques

24

•  Name:Guanpeng(Jus+n)Li([email protected])• Website:ece.ubc.ca/~gpli•  LLFI-GPU:

•  hfps://github.com/DependableSystemsLab/LLFI-GPU•  Results:

•  hfps://www.dropbox.com/s/xrvojidskkcrj4y/FI_data.xlsx?dl=0

Acknowledgements

25

Documents

GPGPU in HPC - UBC Blogsblogs.ubc.ca/karthik/files/2016/11/SC_Talk-Justin.pdf · 2016-11-15 · Challenges for GPGPU Resilience • Diﬀerent architecture and programming model from