Transcript
Page 1: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Containment DomainsA Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems

Jinsuk Chung, Ikhwan Lee, Michael Sullivan, Jee Ho Ryoo, Dong Wan Kim, Doe Hyun Yoon+,

Larry Kaplan*, and Mattan Erez

UT Austin, + now at HP Labs, * Cray

Page 2: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Containment DomainsA Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems

Page 3: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Motivation and goalsā€¢ Resilience bounds performance

ā€“ Resilience is a major obstacle to exascale

Containment domains: scalable efficient resilienceā€¢ Hierarchical

ā€“ Preserve data where most efficient and effectiveā€¢ Proportional

ā€“ Tunable redundancy and recoveryā€“ Different errors/faults handled differently

ā€¢ Abstractā€“ Portableā€“ Amenable to auto-tuning and analysis

3

CDs elevate resilience to a first-order application concern

Containment Domain [SC'12] (c) Jinsuk Chung

Page 4: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Containment domainsā€¢ Single consistent abstraction

ā€“ Encapsulates resilience techniquesā€“ Spans levels: programming, system, and analysis

ā€¢ Componentsā€“ Preserve data on domain startā€“ Compute (domain body)ā€“ Detect faults before domain commitsā€“ Recover from detected errors

ā€¢ Semanticsā€“ Erroneous data never communicated ā€“ Each CD provides recovery mechanism

ā€¢ Hierarchyā€“ Escalationā€“ Match CD and machine hierarchies

Containment Domain [SC'12] (c) Jinsuk Chung 4

Root CD

Child CD

Page 5: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Mapping example: SpMVvoid task<inner> SpMV( in M, in Vi, out Ri){ forall(ā€¦) reduce(ā€¦) SpMV(M[ā€¦],Vi[ā€¦],Ri[ā€¦]);}

void task<leaf> SpMV(ā€¦){ for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]];

prevC=c; }}

Containment Domain [SC'12] (c) Jinsuk Chung 5

š‘“Matrix M

š‘½Vector V

Page 6: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Mapping example: SpMVContainment Domain [SC'12] (c) Jinsuk Chung 6

š‘“ šŸŽšŸŽš‘“ šŸŽšŸš‘“šŸšŸŽš‘“šŸšŸ

Matrix M

š‘½ šŸŽ

Vector V

š‘½ šŸ

void task<inner> SpMV( in M, in Vi, out Ri){ forall(ā€¦) reduce(ā€¦) SpMV(M[ā€¦],Vi[ā€¦],Ri[ā€¦]);}

void task<leaf> SpMV(ā€¦){ for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]];

prevC=c; }}

Page 7: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

7

Mapping example: SpMVContainment Domain [SC'12] (c) Jinsuk Chung

š‘“ šŸŽšŸŽ š‘“ šŸŽšŸš‘“šŸšŸŽ š‘“šŸšŸš‘½ šŸŽ š‘½ šŸš‘½ šŸŽ š‘½ šŸ

š‘“ šŸŽšŸŽš‘“ šŸŽšŸš‘“šŸšŸŽš‘“šŸšŸ

Matrix M

š‘½ šŸŽ

Vector V

š‘½ šŸDistributed to 4 nodes

void task<inner> SpMV( in M, in Vi, out Ri){ forall(ā€¦) reduce(ā€¦) SpMV(M[ā€¦],Vi[ā€¦],Ri[ā€¦]);}

void task<leaf> SpMV(ā€¦){ for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]];

prevC=c; }}

Page 8: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

8

Mapping example: SpMVContainment Domain [SC'12] (c) Jinsuk Chung

š‘“ šŸŽšŸŽš‘“ šŸŽšŸš‘“šŸšŸŽš‘“šŸšŸ

Matrix M

š‘½ šŸŽ

Vector V

š‘½ šŸ

void task<inner> SpMV( in M, in Vi, out Ri){ forall(ā€¦) reduce(ā€¦) SpMV(M[ā€¦],Vi[ā€¦],Ri[ā€¦]);}

void task<leaf> SpMV(ā€¦){ for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]];

prevC=c; }}

Distributed to 4 nodes

Page 9: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

9

Mapping example: SpMVContainment Domain [SC'12] (c) Jinsuk Chung

š‘“ šŸŽšŸŽ š‘½ šŸŽ

Preserve

DetectRecover

š‘“šŸšŸŽ š‘½ šŸŽ

Preserve

DetectRecover

š‘“ šŸŽšŸ š‘½ šŸ

Preserve

DetectRecover

š‘“šŸšŸ š‘½ šŸ

Preserve

DetectRecover

Preserve

DetectRecover

M VParent CD

Child CD

Preserve (Parent)

Detect (Parent)Recover (Parent)

Child

DetectRecover

Child

DetectRecover

Child

DetectRecover

Child

DetectRecover

Page 10: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Initial CD preservation API and prototypevoid task<inner> SpMV(in M, in Vi, out Ri) { cd = create_CD(parentCD);  preserve_via_copy(cd, matrix, ā€¦); forall(ā€¦) reduce(ā€¦) SpMV(M[ā€¦],Vi[ā€¦],Ri[ā€¦]); commit_CD(cd);}void task<leaf> SpMV(ā€¦) { cd = create_CD(parentCD); preserve_via_copy(cd, matrix, ā€¦); preserve_via_parent(cd, veci, ā€¦); for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]]; check {fault<fail>(c > prevC);} prevC=c; } commit_CD(cd);}

Containment Domain [SC'12] (c) Jinsuk Chung 10

Preservation components prototype on Cray XK7http://lph.ece.utexas.edu/public/CDs

APIcreate_CDpreserve_via_copypreserve_via_parentcheckcommit_CD

Page 11: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Containment domains long-term design

Hardware Abstraction Layer

Runtime Library Interface

Machine

efficiency-oriented programming model

int main(int argc, char **argv){ main_task here = phalanx::initialize(argc, argv);

ā€¦ Create test arrays here ā€¦

// Launch kernel on default CPU (ā€œhostā€) openmp_event e1 = async(here, here.processor(), n) (host_saxpy, 2.0f, host_x, host_y); // Launch kernel on default GPU (ā€œdeviceā€) cuda_event e2 = async(here, here.cuda_gpu(0), n) (device_saxpy, 2.0f, device_x, device_y);

wait(here, e1&e2); return 0;}

CD Annotationsresilience model

Error Reporting Architecture

ECC, status

CD control and persistence

Language integration

Compiler support

Runtime components

Hardware aspects

CD APIresilience interface Research prototype by

Cray for XK7 (Titan)

Containment Domain [SC'12] (c) Jinsuk Chung

Page 12: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

12

Outlineā€¢ Motivation and Goalsā€¢ Semantics of Containment Domainsā€¢ What do CDs do? When and why are they good?

ā€“ Differentiated error handlingā€“ Analyzability

ā€¢ Evaluation

Containment Domain [SC'12] (c) Jinsuk Chung

Page 13: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

13Containment Domain [SC'12] (c) Jinsuk Chung

Differentiated Error Handling

Page 14: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

ā€¢ Abstractā€“ Optimized preservation and restorationā€“ Analyzed, auto-tuned ā€“ Allows explicit application control

ā€¢ Hierarchicalā€“ Match storage hierarchyā€“ Maximize locality and minimize overhead

ā€¢ Partialā€“ Preserve only when worth itā€“ Exploit natural redundancyā€“ Exploit hierarchyā€“ Enable regeneration

State preservation and restoration Containment Domain [SC'12] (c) Jinsuk Chung 14

Page 15: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

15

SpMV partial preservation tuningContainment Domain [SC'12] (c) Jinsuk Chung

Natural redundancy

void task<leaf> SpMV(ā€¦) { cd = create_CD(parentCD); preserve_via_copy(cd, matrix, ā€¦); preserve_via_parent(cd, veci, ā€¦); for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]]; check {fault<fail>(c > prevC);} prevC=c; } commit_CD(cd);}

š‘“ šŸŽšŸŽš‘“ šŸŽšŸš‘“šŸšŸŽš‘“šŸšŸ

Matrix M

š‘½ šŸŽ

Vector V

š‘½ šŸHierarchy

š‘“ šŸŽšŸŽ š‘“ šŸŽšŸš‘“šŸšŸŽ š‘“šŸšŸš‘½ šŸŽ š‘½ šŸš‘½ šŸŽ š‘½ šŸ

Page 16: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Concise abstraction for complex behavior

Containment Domain [SC'12] (c) Jinsuk Chung 16

void task<leaf> SpMV(ā€¦) { cd = create_CD(parentCD); preserve_via_copy(cd, matrix, ā€¦); preserve_via_parent(cd, veci, ā€¦); for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]]; check {fault<fail>(c > prevC);} prevC=c; } commit_CD(cd);}

Local copy or regen Sibling Parent (unchanged)

Page 17: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Detectionā€¢ Abstract

ā€“ Utilize most efficient detection mechanismā€“ Low overhead detection: e.g., algorithm specific detection

ā€¢ Customizedā€“ Replicate in time, replicate in space, algorithm specific

ā€¢ Heterogeneousā€“ Per-CD routinesā€“ E.g., selective multi-granularity DMR

Containment Domain [SC'12] (c) Jinsuk Chung 17

Page 18: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Recoveryā€¢ Abstract

ā€“ Utilize most efficient recovery mechanismā€“ Maximize local recoveryā€“ Low overhead recovery e.g., re-materialization or

regenerationā€¢ Customized

ā€“ Re-execute, ignore, re-materialize, DMR, TMRā€¢ Heterogeneous

ā€“ Per-CD routinesā€“ E.g., selective multi-

granularity DMRā€“ App/system specific

Containment Domain [SC'12] (c) Jinsuk Chung 18

Compute

Preserve

Detect

Re-execution overhead

Tim

e

Page 19: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

19Containment Domain [SC'12] (c) Jinsuk Chung

Analyzability

Page 20: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

ā€¢ Leverage hierarchy and CD semanticsā€“ Uncoordinated ā€œlocalā€ actionsā€“ Solve in out

ā€¢ Application abstracted to CDsā€“ CD treeā€“ Volumes of preservation,

computation, and communication

ā€“ Preservation and recovery options per CD

ā€¢ Machine modelā€“ Storage hierarchyā€“ Communication hierarchyā€“ Bandwidths and capacitiesā€“ Error processes and rates

Analytical Model20Containment Domain [SC'12] (c) Jinsuk Chung

Exec

utio

n ti

me

Page 21: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Power modelā€¢ CDs that are not re-executing may remain idleā€¢ Actively executing a CD has a relative power of 1ā€¢ A node that is idling consumes a relative power of

ā€“ In our experiments

21

Idle

Containment Domain [SC'12] (c) Jinsuk Chung

Re-e

xecu

tion

time

Parallel domains

Execution Re-execution

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Page 22: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Evaluationā€¢ What we evaluated

ā€“ Performance efficiency ā€“ Energy overhead

ā€¢ Baseline resiliency approachesā€“ g-CPR: global checkpoint restartā€“ h-CPR: hierarchical checkpoint restart (e.g., SCR)ā€“ Optimum interval used for each

ā€¢ CD advantagesā€“ Preserve only what is neededā€“ Hierarchical uncoordinated

ā€¢ Assumptionsā€“ Detection overhead is assumed to be zeroā€“ Capacity of storage for preservation is infiniteā€“ Infinite spares (quick repair)

22Containment Domain [SC'12] (c) Jinsuk Chung

Page 23: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Machine and error models23Containment Domain [SC'12] (c) Jinsuk Chung

Component ā€œPerformanceā€ Error Error ScalingCore 10GFLOP/core Soft error āˆ #coresMemory 1GB/core ECC fail āˆ #DRAM chipsSocket 200GB/s /socket Hard/OS

crashāˆ #sockets

System Hierarchical network

Power moduleor network

āˆ #modules and #cabinets

Page 24: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Workloadsā€¢ Monte Carlo NT

ā€“ Embarrassingly parallelā€“ Infrequent communicationā€“ Small fraction of read/write data

ā€¢ Iterative hierarchical SpMVā€“ Recursive decompositionā€“ Natural redundancyā€“ Frequent global communication

ā€¢ Mantevo HPCCGā€“ Requires little storageā€“ Conjugate-gradient based linear system solverā€“ Frequent global communication

24Containment Domain [SC'12] (c) Jinsuk Chung

Page 25: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Evaluation toolsā€¢ Simulator

ā€“ Executes at granularity of containment domainsā€“ Reexecutes when error is detectedā€“ Used to validate the analytical model

ā€¢ Analytical Modelā€“ Simulation is too slow for evaluating exascale systemsā€“ Inputs to the model: extracted from each application

ā€¢ Volume of preservation, restoration, computation and communicationā€¢ Error ratesā€¢ Shape of CD structure

ā€¢ Validationā€“ Simulator and analytical modelā€“ Prototype of preservation/restoration on Cray XK7

25Containment Domain [SC'12] (c) Jinsuk Chung

Page 26: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

26Containment Domain [SC'12] (c) Jinsuk Chung

Peak System Performance

NT

SpMV

HPCCG

Autotuned CDs perform well

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80% CDs, NTh-CPR, 80%

Perf

or-

man

ce E

f-

ficie

ncy

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80%CDs, SpMVh-CPR, 50%Pe

rfor

-m

ance

Ef

-fic

ienc

y

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80% CDs, HPCCGh-CPR, 10%g-CPR, 10%Pe

rfor

-m

ance

Ef

-fic

ienc

y

Page 27: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

27Containment Domain [SC'12] (c) Jinsuk Chung

Peak System Performance

NT

SpMV

HPCCG

Autotuned CDs perform well

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80% CDs, NTh-CPR, 80%

Perf

or-

man

ce E

f-

ficie

ncy

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80%CDs, SpMVh-CPR, 50%Pe

rfor

-m

ance

Ef

-fic

ienc

y

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80% CDs, HPCCGh-CPR, 10%g-CPR, 10%Pe

rfor

-m

ance

Ef

-fic

ienc

y

Page 28: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

SPMV, HPCCG: local recovery and partial preservation

28Containment Domain [SC'12] (c) Jinsuk Chung

Disk

Remote NVM

Local NVM

DRAM

Partial preservation via sibling or parent where appropriate

Page 29: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

NT: hierarchical local recovery and partial preservation

29Containment Domain [SC'12] (c) Jinsuk Chung

Disk

Remote NVM

Local NVM

DRAM

Partial preservation via sibling, parent, or regeneration where appropriate

Page 30: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

30Containment Domain [SC'12] (c) Jinsuk Chung

Peak System Performance

NT

SpMV

HPCCG

Autotuned CDs perform well

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80% CDs, NTh-CPR, 80%

Perf

or-

man

ce E

f-

ficie

ncy

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80%CDs, SpMVh-CPR, 50%Pe

rfor

-m

ance

Ef

-fic

ienc

y

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80% CDs, HPCCGh-CPR, 10%g-CPR, 10%Pe

rfor

-m

ance

Ef

-fic

ienc

y

Page 31: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

10%

20%CDs, NTh-CPR, 80%

Ener

gy

Ove

rhea

d

31Containment Domain [SC'12] (c) Jinsuk Chung

Peak System Performance

CDs improve energy efficiency at scale

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

10%

20%CDs, SpMVh-CPR, 50%

Ener

gy

Ove

rhea

d

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

10%

20% CDs, HPCCGh-CPR, 10%g-CPR, 10%

Ener

gy

Ove

rhea

d

NT

SpMV

HPCCG

Page 32: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

10%

20%CDs, SpMVh-CPR, 50%

Ener

gy

Ove

rhea

d

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

10%

20%CDs, NTh-CPR, 80%

Ener

gy

Ove

rhea

d

32Containment Domain [SC'12] (c) Jinsuk Chung

Peak System Performance

CDs improve energy efficiency at scale

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

10%

20% CDs, HPCCGh-CPR, 10%g-CPR, 10%

Ener

gy

Ove

rhea

d

NT

SpMV

HPCCG

Page 33: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

10X failure rate emphasizes CD benefits

33Containment Domain [SC'12] (c) Jinsuk Chung

Peak Performance

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

20%40%60%80%

100%

0%20%40%60%80%100%

CDs, NTh-CPR, 80%

Perf

orm

ance

Effi

cien

cy

Ener

gy

Ove

rhea

d

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

20%40%60%80%

100%

0%20%40%60%80%100%

CDs, SpMV

Perf

orm

ance

Effi

cien

cy

Ener

gy

Ove

rhea

d

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

20%40%60%80%

100%

0%20%40%60%80%100%

CDs, HPCCG

Perf

orm

ance

Effi

cien

cy

Ener

gy

Ove

rhea

d

Energy Overhead

Page 34: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

34

More in the paperā€¢ Strict vs. relaxed containment domainsā€¢ Analytical model detailsā€¢ Error and machine model detailsā€¢ Additional sensitivity studiesā€¢ Related work discussion

Containment Domain [SC'12] (c) Jinsuk Chung

Page 35: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Conclusionā€¢ Containment domains

ā€“ Abstract constructs for resilience concerns & techniquesā€“ Proportional and application/machine tuned resilienceā€“ Hierarchical & distributed preservation, restoration, and

recoveryā€“ Analyzable and amendable to automatic optimizationā€“ Scalable to large systems with high relative energy efficiencyā€“ Heterogeneous to match emerging architecture

ā€¢ Good start and exciting work aheadā€“ Preservation concept prototyped on Cray XK7ā€“ Fine-grained CDs for high error ratesā€“ Compiler optimizations and supportā€“ Application-specific detection/elision ā€“ PGAS support and interactions with system ā€“ Interaction with other models (tasking, DSLs, ā€¦)

35

http://lph.ece.utexas.edu/public/CDs

Containment Domain [SC'12] (c) Jinsuk Chung

Page 36: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Questions?

Thank you

36Containment Domain [SC'12] (c) Jinsuk Chung


Recommended