Containment DomainsA Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems
Jinsuk Chung, Ikhwan Lee, Michael Sullivan, Jee Ho Ryoo, Dong Wan Kim, Doe Hyun Yoon+,
Larry Kaplan*, and Mattan Erez
UT Austin, + now at HP Labs, * Cray
Containment DomainsA Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems
Motivation and goalsā¢ Resilience bounds performance
ā Resilience is a major obstacle to exascale
Containment domains: scalable efficient resilienceā¢ Hierarchical
ā Preserve data where most efficient and effectiveā¢ Proportional
ā Tunable redundancy and recoveryā Different errors/faults handled differently
ā¢ Abstractā Portableā Amenable to auto-tuning and analysis
3
CDs elevate resilience to a first-order application concern
Containment Domain [SC'12] (c) Jinsuk Chung
Containment domainsā¢ Single consistent abstraction
ā Encapsulates resilience techniquesā Spans levels: programming, system, and analysis
ā¢ Componentsā Preserve data on domain startā Compute (domain body)ā Detect faults before domain commitsā Recover from detected errors
ā¢ Semanticsā Erroneous data never communicated ā Each CD provides recovery mechanism
ā¢ Hierarchyā Escalationā Match CD and machine hierarchies
Containment Domain [SC'12] (c) Jinsuk Chung 4
Root CD
Child CD
Mapping example: SpMVvoid task<inner> SpMV( in M, in Vi, out Ri){ forall(ā¦) reduce(ā¦) SpMV(M[ā¦],Vi[ā¦],Ri[ā¦]);}
void task<leaf> SpMV(ā¦){ for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]];
prevC=c; }}
Containment Domain [SC'12] (c) Jinsuk Chung 5
š“Matrix M
š½Vector V
Mapping example: SpMVContainment Domain [SC'12] (c) Jinsuk Chung 6
š“ ššš“ ššš“ššš“šš
Matrix M
š½ š
Vector V
š½ š
void task<inner> SpMV( in M, in Vi, out Ri){ forall(ā¦) reduce(ā¦) SpMV(M[ā¦],Vi[ā¦],Ri[ā¦]);}
void task<leaf> SpMV(ā¦){ for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]];
prevC=c; }}
7
Mapping example: SpMVContainment Domain [SC'12] (c) Jinsuk Chung
š“ šš š“ ššš“šš š“ššš½ š š½ šš½ š š½ š
š“ ššš“ ššš“ššš“šš
Matrix M
š½ š
Vector V
š½ šDistributed to 4 nodes
void task<inner> SpMV( in M, in Vi, out Ri){ forall(ā¦) reduce(ā¦) SpMV(M[ā¦],Vi[ā¦],Ri[ā¦]);}
void task<leaf> SpMV(ā¦){ for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]];
prevC=c; }}
8
Mapping example: SpMVContainment Domain [SC'12] (c) Jinsuk Chung
š“ ššš“ ššš“ššš“šš
Matrix M
š½ š
Vector V
š½ š
void task<inner> SpMV( in M, in Vi, out Ri){ forall(ā¦) reduce(ā¦) SpMV(M[ā¦],Vi[ā¦],Ri[ā¦]);}
void task<leaf> SpMV(ā¦){ for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]];
prevC=c; }}
Distributed to 4 nodes
9
Mapping example: SpMVContainment Domain [SC'12] (c) Jinsuk Chung
š“ šš š½ š
Preserve
DetectRecover
š“šš š½ š
Preserve
DetectRecover
š“ šš š½ š
Preserve
DetectRecover
š“šš š½ š
Preserve
DetectRecover
Preserve
DetectRecover
M VParent CD
Child CD
Preserve (Parent)
Detect (Parent)Recover (Parent)
Child
DetectRecover
Child
DetectRecover
Child
DetectRecover
Child
DetectRecover
Initial CD preservation API and prototypevoid task<inner> SpMV(in M, in Vi, out Ri) { cd = create_CD(parentCD); preserve_via_copy(cd, matrix, ā¦); forall(ā¦) reduce(ā¦) SpMV(M[ā¦],Vi[ā¦],Ri[ā¦]); commit_CD(cd);}void task<leaf> SpMV(ā¦) { cd = create_CD(parentCD); preserve_via_copy(cd, matrix, ā¦); preserve_via_parent(cd, veci, ā¦); for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]]; check {fault<fail>(c > prevC);} prevC=c; } commit_CD(cd);}
Containment Domain [SC'12] (c) Jinsuk Chung 10
Preservation components prototype on Cray XK7http://lph.ece.utexas.edu/public/CDs
APIcreate_CDpreserve_via_copypreserve_via_parentcheckcommit_CD
Containment domains long-term design
Hardware Abstraction Layer
Runtime Library Interface
Machine
efficiency-oriented programming model
int main(int argc, char **argv){ main_task here = phalanx::initialize(argc, argv);
ā¦ Create test arrays here ā¦
// Launch kernel on default CPU (āhostā) openmp_event e1 = async(here, here.processor(), n) (host_saxpy, 2.0f, host_x, host_y); // Launch kernel on default GPU (ādeviceā) cuda_event e2 = async(here, here.cuda_gpu(0), n) (device_saxpy, 2.0f, device_x, device_y);
wait(here, e1&e2); return 0;}
CD Annotationsresilience model
Error Reporting Architecture
ECC, status
CD control and persistence
Language integration
Compiler support
Runtime components
Hardware aspects
CD APIresilience interface Research prototype by
Cray for XK7 (Titan)
Containment Domain [SC'12] (c) Jinsuk Chung
12
Outlineā¢ Motivation and Goalsā¢ Semantics of Containment Domainsā¢ What do CDs do? When and why are they good?
ā Differentiated error handlingā Analyzability
ā¢ Evaluation
Containment Domain [SC'12] (c) Jinsuk Chung
13Containment Domain [SC'12] (c) Jinsuk Chung
Differentiated Error Handling
ā¢ Abstractā Optimized preservation and restorationā Analyzed, auto-tuned ā Allows explicit application control
ā¢ Hierarchicalā Match storage hierarchyā Maximize locality and minimize overhead
ā¢ Partialā Preserve only when worth itā Exploit natural redundancyā Exploit hierarchyā Enable regeneration
State preservation and restoration Containment Domain [SC'12] (c) Jinsuk Chung 14
15
SpMV partial preservation tuningContainment Domain [SC'12] (c) Jinsuk Chung
Natural redundancy
void task<leaf> SpMV(ā¦) { cd = create_CD(parentCD); preserve_via_copy(cd, matrix, ā¦); preserve_via_parent(cd, veci, ā¦); for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]]; check {fault<fail>(c > prevC);} prevC=c; } commit_CD(cd);}
š“ ššš“ ššš“ššš“šš
Matrix M
š½ š
Vector V
š½ šHierarchy
š“ šš š“ ššš“šš š“ššš½ š š½ šš½ š š½ š
Concise abstraction for complex behavior
Containment Domain [SC'12] (c) Jinsuk Chung 16
void task<leaf> SpMV(ā¦) { cd = create_CD(parentCD); preserve_via_copy(cd, matrix, ā¦); preserve_via_parent(cd, veci, ā¦); for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]]; check {fault<fail>(c > prevC);} prevC=c; } commit_CD(cd);}
Local copy or regen Sibling Parent (unchanged)
Detectionā¢ Abstract
ā Utilize most efficient detection mechanismā Low overhead detection: e.g., algorithm specific detection
ā¢ Customizedā Replicate in time, replicate in space, algorithm specific
ā¢ Heterogeneousā Per-CD routinesā E.g., selective multi-granularity DMR
Containment Domain [SC'12] (c) Jinsuk Chung 17
Recoveryā¢ Abstract
ā Utilize most efficient recovery mechanismā Maximize local recoveryā Low overhead recovery e.g., re-materialization or
regenerationā¢ Customized
ā Re-execute, ignore, re-materialize, DMR, TMRā¢ Heterogeneous
ā Per-CD routinesā E.g., selective multi-
granularity DMRā App/system specific
Containment Domain [SC'12] (c) Jinsuk Chung 18
Compute
Preserve
Detect
Re-execution overhead
Tim
e
19Containment Domain [SC'12] (c) Jinsuk Chung
Analyzability
ā¢ Leverage hierarchy and CD semanticsā Uncoordinated ālocalā actionsā Solve in out
ā¢ Application abstracted to CDsā CD treeā Volumes of preservation,
computation, and communication
ā Preservation and recovery options per CD
ā¢ Machine modelā Storage hierarchyā Communication hierarchyā Bandwidths and capacitiesā Error processes and rates
Analytical Model20Containment Domain [SC'12] (c) Jinsuk Chung
Exec
utio
n ti
me
Power modelā¢ CDs that are not re-executing may remain idleā¢ Actively executing a CD has a relative power of 1ā¢ A node that is idling consumes a relative power of
ā In our experiments
21
Idle
Containment Domain [SC'12] (c) Jinsuk Chung
Re-e
xecu
tion
time
Parallel domains
Execution Re-execution
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Evaluationā¢ What we evaluated
ā Performance efficiency ā Energy overhead
ā¢ Baseline resiliency approachesā g-CPR: global checkpoint restartā h-CPR: hierarchical checkpoint restart (e.g., SCR)ā Optimum interval used for each
ā¢ CD advantagesā Preserve only what is neededā Hierarchical uncoordinated
ā¢ Assumptionsā Detection overhead is assumed to be zeroā Capacity of storage for preservation is infiniteā Infinite spares (quick repair)
22Containment Domain [SC'12] (c) Jinsuk Chung
Machine and error models23Containment Domain [SC'12] (c) Jinsuk Chung
Component āPerformanceā Error Error ScalingCore 10GFLOP/core Soft error ā #coresMemory 1GB/core ECC fail ā #DRAM chipsSocket 200GB/s /socket Hard/OS
crashā #sockets
System Hierarchical network
Power moduleor network
ā #modules and #cabinets
Workloadsā¢ Monte Carlo NT
ā Embarrassingly parallelā Infrequent communicationā Small fraction of read/write data
ā¢ Iterative hierarchical SpMVā Recursive decompositionā Natural redundancyā Frequent global communication
ā¢ Mantevo HPCCGā Requires little storageā Conjugate-gradient based linear system solverā Frequent global communication
24Containment Domain [SC'12] (c) Jinsuk Chung
Evaluation toolsā¢ Simulator
ā Executes at granularity of containment domainsā Reexecutes when error is detectedā Used to validate the analytical model
ā¢ Analytical Modelā Simulation is too slow for evaluating exascale systemsā Inputs to the model: extracted from each application
ā¢ Volume of preservation, restoration, computation and communicationā¢ Error ratesā¢ Shape of CD structure
ā¢ Validationā Simulator and analytical modelā Prototype of preservation/restoration on Cray XK7
25Containment Domain [SC'12] (c) Jinsuk Chung
26Containment Domain [SC'12] (c) Jinsuk Chung
Peak System Performance
NT
SpMV
HPCCG
Autotuned CDs perform well
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
40%
80% CDs, NTh-CPR, 80%
Perf
or-
man
ce E
f-
ficie
ncy
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
40%
80%CDs, SpMVh-CPR, 50%Pe
rfor
-m
ance
Ef
-fic
ienc
y
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
40%
80% CDs, HPCCGh-CPR, 10%g-CPR, 10%Pe
rfor
-m
ance
Ef
-fic
ienc
y
27Containment Domain [SC'12] (c) Jinsuk Chung
Peak System Performance
NT
SpMV
HPCCG
Autotuned CDs perform well
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
40%
80% CDs, NTh-CPR, 80%
Perf
or-
man
ce E
f-
ficie
ncy
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
40%
80%CDs, SpMVh-CPR, 50%Pe
rfor
-m
ance
Ef
-fic
ienc
y
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
40%
80% CDs, HPCCGh-CPR, 10%g-CPR, 10%Pe
rfor
-m
ance
Ef
-fic
ienc
y
SPMV, HPCCG: local recovery and partial preservation
28Containment Domain [SC'12] (c) Jinsuk Chung
Disk
Remote NVM
Local NVM
DRAM
Partial preservation via sibling or parent where appropriate
NT: hierarchical local recovery and partial preservation
29Containment Domain [SC'12] (c) Jinsuk Chung
Disk
Remote NVM
Local NVM
DRAM
Partial preservation via sibling, parent, or regeneration where appropriate
30Containment Domain [SC'12] (c) Jinsuk Chung
Peak System Performance
NT
SpMV
HPCCG
Autotuned CDs perform well
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
40%
80% CDs, NTh-CPR, 80%
Perf
or-
man
ce E
f-
ficie
ncy
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
40%
80%CDs, SpMVh-CPR, 50%Pe
rfor
-m
ance
Ef
-fic
ienc
y
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
40%
80% CDs, HPCCGh-CPR, 10%g-CPR, 10%Pe
rfor
-m
ance
Ef
-fic
ienc
y
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
10%
20%CDs, NTh-CPR, 80%
Ener
gy
Ove
rhea
d
31Containment Domain [SC'12] (c) Jinsuk Chung
Peak System Performance
CDs improve energy efficiency at scale
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
10%
20%CDs, SpMVh-CPR, 50%
Ener
gy
Ove
rhea
d
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
10%
20% CDs, HPCCGh-CPR, 10%g-CPR, 10%
Ener
gy
Ove
rhea
d
NT
SpMV
HPCCG
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
10%
20%CDs, SpMVh-CPR, 50%
Ener
gy
Ove
rhea
d
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
10%
20%CDs, NTh-CPR, 80%
Ener
gy
Ove
rhea
d
32Containment Domain [SC'12] (c) Jinsuk Chung
Peak System Performance
CDs improve energy efficiency at scale
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
10%
20% CDs, HPCCGh-CPR, 10%g-CPR, 10%
Ener
gy
Ove
rhea
d
NT
SpMV
HPCCG
10X failure rate emphasizes CD benefits
33Containment Domain [SC'12] (c) Jinsuk Chung
Peak Performance
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
20%40%60%80%
100%
0%20%40%60%80%100%
CDs, NTh-CPR, 80%
Perf
orm
ance
Effi
cien
cy
Ener
gy
Ove
rhea
d
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
20%40%60%80%
100%
0%20%40%60%80%100%
CDs, SpMV
Perf
orm
ance
Effi
cien
cy
Ener
gy
Ove
rhea
d
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%
20%40%60%80%
100%
0%20%40%60%80%100%
CDs, HPCCG
Perf
orm
ance
Effi
cien
cy
Ener
gy
Ove
rhea
d
Energy Overhead
34
More in the paperā¢ Strict vs. relaxed containment domainsā¢ Analytical model detailsā¢ Error and machine model detailsā¢ Additional sensitivity studiesā¢ Related work discussion
Containment Domain [SC'12] (c) Jinsuk Chung
Conclusionā¢ Containment domains
ā Abstract constructs for resilience concerns & techniquesā Proportional and application/machine tuned resilienceā Hierarchical & distributed preservation, restoration, and
recoveryā Analyzable and amendable to automatic optimizationā Scalable to large systems with high relative energy efficiencyā Heterogeneous to match emerging architecture
ā¢ Good start and exciting work aheadā Preservation concept prototyped on Cray XK7ā Fine-grained CDs for high error ratesā Compiler optimizations and supportā Application-specific detection/elision ā PGAS support and interactions with system ā Interaction with other models (tasking, DSLs, ā¦)
35
http://lph.ece.utexas.edu/public/CDs
Containment Domain [SC'12] (c) Jinsuk Chung
Questions?
Thank you
36Containment Domain [SC'12] (c) Jinsuk Chung