21
Modeling Soft-Error Propagation in Programs Guanpeng (Justin) Li Karthik Pattabiraman Siva Hari Michael Sullivan Timothy Tsai

Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

ModelingSoft-ErrorPropagationinProgramsGuanpeng (Justin)LiKarthik Pattabiraman

SivaHariMichaelSullivanTimothyTsai

Page 2: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

Motivation:SoftErrors

2

= 0001 = 0101

[1]

Softerrorsbecomingmorecommoninprocessors

[1] http://aviral.lab.asu.edu/soft-error-resilience/

Page 3: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

SilentDataCorruption(SDC)

NormalExecution

Fault

ErrorPropagation

SDC

Crash

Benign

IncorrectOutput

CorrectOutput

Exceptions,NoOutput

AmazonS3Incident

3

Page 4: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

SoftwareSolutions

Device/CircuitLevel

ArchitecturalLevel

OperatingSystemLevel

ApplicationLevel

ImpactfulErrors

Protectio

nOverhead

SoftError

4

Increasing

Softwareprotection techniquesaremoreflexibleandcost-effective!

Page 5: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

SelectiveInstructionDuplication

“TheGoldenCurve”

SDCCoverage

ProtectionOverhead

ApplicationSpecific!

*MeasuredinLibquantum,SPEC

InstructionSequence InstructionDuplication

Instruction:SDCRate=X%Overhead=Y%

SelectedInstructionsforGivenTargetSDCCoverage

AKnapsackProblem

5

Page 6: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

DevelopingFault-TolerantApplications

DevelopmentofApplication EvaluateProgramSDCRate

SelectiveProtection

Acceptable

NewRelease

MeasureInstruction SDCRates

1. Thousandsoffaultinjectionsneedtobedone2.Repeateverytimecodeismodified

6

Page 7: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

EstimatingSDCRate

OurGoal

Accuracy

Speed

AVF/PVF/ePVF

[MICRO’03,HPCA’10,DSN’16]

SymPLFIED/Relyzer/GangES

[DSN’08,ASPLOS’12,ISCA’14]

Noexistingtechniquemodelserrorpropagationinbothfastandaccurateway!

FastpredictionofSDCwithoutfaultinjection!

8

Page 8: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

Challenges

• TrackingSDCpropagationishard

• Overbillionsofexecutedinstructions

• Everyinstructionmaypropagateerrorswithdifferentprobabilities

• Dynamicnatureofprogramexecution

• Control-flowdivergence

… …

BR

… …

Corruptingsubsequentstates

T F

8

… …… …… …… …

Page 9: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

Trident:KeyInsight

• Errorpropagationscanbedecomposedintomodules,whichcan

beabstractedintoprobabilisticevents

• Decomposition

• Abstraction

9

Page 10: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

Trident:Workflow

SourceCode

ProgramInput

OutputInsn.

Insn.SDCRates

OverallSDCRate

Insn.forPrediction

Profiling Prediction

10

Page 11: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

BB12

… …

Trident:OurApproach

• Three-levelmodeling

• Register-communication

• Control-flow

• Memorydependency

Reg.

Mem.Contl.

BB4

$2=LOAD0x04

$3=ADD$2,4

CMP$4,$3,4

BR$4,BB5,BB10

BB5

$5=MUL$6,16

… …

BB10

… …

… …

BB102

...=LOAD0x08

T1 F1

T2 F2

fS

fC fM

BB11STORE…,0x08

11

Page 12: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

fs =100%*100%*25%*100%=25%

BB12

… …BB11STORE…,0x08

BB4

$2=LOAD0x04

$3=ADD$2,4

CMP$4,$3,4

BR$4,BB5,BB10

BB5

$5=MUL$6,16

… …

BB10

… …

… …

BB102

...=LOAD0x08

T1 F1

T2 F2

<100%>

<100%>

<25%>

<100%>

PropagationprobabilitywithinBB4?

Reg.

Mem.Contl.

fS

fC fM

Reg.

12

Trident:RegisterCommn.

Page 13: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

Trident:Control-Flow

BB12

… …BB11STORE…,0x08

BB4

$2=LOAD0x04

$3=ADD$2,4

CMP$4,$3,4

BR$4,BB5,BB10

BB5

$5=MUL$6,16

… …

BB10

… …

… …

BB102

...=LOAD0x08

T1 F1

T2 F2

CorruptionprobabilityofSTORE?

80% 20%

30% 70%

<100%>

<100%>

<25%>

<100%>=

*Fornon-loop-terminatingbranches

Reg.

Mem.Contl.

fS

fC fM

Contl.

fC

STOREexec.prob.F1*T2

BRdom.prob.F1

Corrupted

13

Page 14: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

Trident:Memory-Dependency

BB12

… …BB11STORE…,0x08

BB4

$2=LOAD0x04

$3=ADD$2,4

CMP$4,$3,4

BR$4,BB5,BB10

BB5

$5=MUL$6,16

… …

BB10

… …

… …

BB102

...=LOAD0x08

T1 F1

T2 F2

DependentLOAD&STORE

80% 20%

30% 70%

<100%>

<100%>

<25%>

<100%>

Reg.

Mem.Contl.

fS

fC fM

Mem.

P(In) = fS (In)* fC (In2)* fS (In3)* fC (In4) … …

14

*ncorrespondstotheindexofdynamicinstructions

Page 15: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

ExperimentalSetup

BenchmarkApplication Domains

15

• FaultModel• Singlebit-flipinjections– accurate[DSN’17]

• Randominsn.– oneperprogramexecution

• Benchmarks• 11open-sourcebenchmarksfromvariousdomains

• Comparisonwithfaultinjection• Accuracy

• Speed(wallclocktime)

Page 16: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

ExperimentalMethodology

Reg.

Mem.Contl.

fS

Reg.

Mem.Contl.

fS+fCTwoSimplerModelsforComparison

GoalistopredictSDCrateasperfaultinjection

[1]LLVMFaultInjector[DSN’14]

Reminder:

16

• Baseline:FaultinjectionderivedbyLLFI[1]

• ThecloserSDCratetofaultinjection, thebetterprediction

• Createdtwosimplermodels

• Accuracyofeachsub-model

• Asproxytopriorwork

Page 17: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

Evaluation:Accuracy

• MeanAbsoluteError• Trident:4.75%• SimplerModels:15.13%and19.13%

• t-TestonIndividualInstructions• Trident:8outof11arestatisticallyindistinguishable• SimplerModels(fS andfS+fC):Only2and4

ProgramSDCRate;3,000Sampled Instructions;ErrorBar:+/-0.07%~+/-1.76%at95%ConfidenceInterval

Trident isclosetofaultinjectionresults,andsignificantlybetterthanthesimplermodels!

3,000randomlysampledinstructionsforfaultinjection

andthemodels

17

Page 18: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

Evaluation:Speed

• Program’sOverallSDCRate:• 6.7xfasterat3,000samples

• Per-InstructionSDCRate:• Onaverage,380xfasterat100samples

perinstruction

• Benchmarks:FItakesnearly100hourswhereasTridenttakes<20mins

Trident isfasterthanfaultinjectionby2ordersofmagnitude!

Wall-Clock TimeofEstimatingProgramSDCRate

18

Page 19: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

UseCase:SelectiveInstructionDuplication

SDCCoverage

ProtectionOverhead

*MeasuredinLibquantum,SPEC

ByFaultInjections

ByTrident

“TheGoldenCurve”

ByfS+fCByfS

SelectiveInstructionDuplication

Recap:

19

Page 20: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

Extension

• Understandhowerrorpropagationisaffectedbymultipleinputs

• ExtensionforboundingSDCratewithmultipleinputs

20

Session6:ModelingandVerificationWednesday,June27th

“ModelingInput-DependentErrorPropagationinPrograms”

Page 21: Modeling Soft-Error Propagation in Programsblogs.ubc.ca/karthik/files/2018/06/DSN18_trident_talk.pdf · Challenges • Tracking SDC propagation is hard • Over billions of executed

Summary

• Faultinjectionsaretooslowtointegrateintosoftwaredevelopmentcycle

• Trident isbothaccurateandfastinpredictingSDCrates

• Canguideselectiveprotectionofinstructionsinprograms– comparable

tofaultinjectioninaccuracyforfractionofcost

• OpenSource:https://github.com/DependableSystemsLab/Trident

Guanpeng (Justin)LiUniversityofBritishColumbia (UBC)

[email protected]