79
Toward Exascale Resilience Part 8: Containment Domains / cross-layer schemes Mattan Erez The University of Texas at Austin July 2015

Toward Exascale Resilience - University of Texas at Austin

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Toward Exascale Resilience - University of Texas at Austin

Toward Exascale ResiliencePart 8: Containment Domains / cross-layer schemes

Mattan Erez

The University of Texas at Austin

July 2015

Page 2: Toward Exascale Resilience - University of Texas at Austin

•Credit to:

•UT Austin students:

– Benjaming Cho, Jinsuk Chung, Ali Fakhrzadehgan, Ikhwan Lee, KyushickLee, Seong-Lyong Gong, Mike Sullivan, Song Zhang, Doe Hyun Yoon (now at Google)

•Collaborators (growing list)

– Cray, NVIDIA, ETI

– LBNL: Brian Austin, Dan Bonachea, Paul Hargrove , Sherry Li, Eric Roman

•Funding agencies

– DOE ECRP, XStack, FF, PSAAP II

– Initial funding from DARPA UHPC

2Containment Domains Resilience (c) Mattan Erez

Page 3: Toward Exascale Resilience - University of Texas at Austin

•The constraints:

– Power/energy

– Time

– Money

– Correctness

(c) Mattan Erez

Page 4: Toward Exascale Resilience - University of Texas at Austin

•Resilience is a big challenge forDOE computations

4Containment Domains Resilience (c) Mattan Erez

Page 5: Toward Exascale Resilience - University of Texas at Austin

Something bad every ~minute at DOE scale

5Containment Domains Resilience (c) Mattan Erez

1

10

100

1000

2013(15PF)

2015(40PF)

2017(130PF)

2019(450PF)

2021(1500PF)

Pro

ject

ed

MT

TI

[ho

urs

]

Year (Expected performance in PetaFlops)

Hard

Soft

Intermittent SW

Overall

Page 6: Toward Exascale Resilience - University of Texas at Austin

•The baseline: checkpoint-restart

•Not good enough on its own

6Containment Domains Resilience (c) Mattan Erez

Page 7: Toward Exascale Resilience - University of Texas at Austin

•Failure rate too high for checkpoint/restart

•Correctness also at risk

7Containment Domains Resilience (c) Mattan Erez

0%

20%

40%

60%

80%

100%

Pe

rfo

rma

nce

Eff

icie

ncy

CDs, NT

h-CPR, 80%

g-CPR, 80%

Page 8: Toward Exascale Resilience - University of Texas at Austin

•Energy also problematic

8Containment Domains Resilience (c) Mattan Erez

0%

10%

20%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

En

erg

y O

ve

rhe

ad

CDs, NT

h-CPR, 80%

gCPR, 80%

Page 9: Toward Exascale Resilience - University of Texas at Austin

•The cost of resilience

– Preparation

– Detection

– Mitigation (repair + recover)

– Implementation

9Containment Domains Resilience (c) Mattan Erez

Page 10: Toward Exascale Resilience - University of Texas at Austin

•Software?

• Hardware?

• Algorithm?

10Containment Domains Resilience (c) Mattan Erez

Page 11: Toward Exascale Resilience - University of Texas at Austin

•Software?

• Hardware?

• Algorithm?

•Containment Domains: adaptive holistic approach

– Per-experiment balance of energy, time, money, correctness

11Containment Domains Resilience (c) Mattan Erez

Page 12: Toward Exascale Resilience - University of Texas at Austin

•Can hardware alone solve the problem?

•Yes, but costly

– Significant and likely fixed overheads

– May not be needed in many commercial settings

12Containment Domains Resilience (c) Mattan Erez

Page 13: Toward Exascale Resilience - University of Texas at Austin

•Fixed overhead examples (estimated)Both energy and/or throughput

– Up to ~25% chipkill correct vs. chipkill detect

– 20 – 40% for pipeline SDC reduction

– >2X for arbitrary correction

– Even greater overhead if protecting approximate units

13Containment Domains Resilience (c) Mattan Erez

Page 14: Toward Exascale Resilience - University of Texas at Austin

•Something bad every ~minute at DOE

•Something bad every year commercially

– Smaller units of execution

– Different requirements

14Containment Domains Resilience (c) Mattan Erez

Page 15: Toward Exascale Resilience - University of Texas at Austin

•Locality and hierarchy are key

– Hierarchical constructs

– Distributed operation

•Range of correctness requirements

15Containment Domains Resilience (c) Mattan Erez

Page 16: Toward Exascale Resilience - University of Texas at Austin

•What about algorithmic resilience?

– Algorithmic detection

– Iterative converging algorithms

– Redundant information

– Probabilistic methods

16Containment Domains Resilience (c) Mattan Erez

Page 17: Toward Exascale Resilience - University of Texas at Austin

•Examples on board

– Algorithmic check of matrix multiplication

– Algorithmic check of a solver

– Convergent calculation•Simple and basic Newton-Raphson

– Monte Carlo

17(c) Mattan Erez

Page 18: Toward Exascale Resilience - University of Texas at Austin

•But,

• Different apps different techniquesDifferent scales different techniques

Page 19: Toward Exascale Resilience - University of Texas at Austin

•Need to adapt/co-tune

Page 20: Toward Exascale Resilience - University of Texas at Austin

•Containment Domains elevate resilience to first-class abstraction

– Program-structure abstractions

– Composable resilient program components

– Regimented development flow

– Supporting tools and mechanisms

20Containment Domains Resilience (c) Mattan Erez

Page 21: Toward Exascale Resilience - University of Texas at Austin

•Containment Domains

– Abstract resilience constructs that span system layers

– Hierarchical and Distributed operation for locality

– Scalable to large systems with high energy efficiency

– Heterogeneous to match disparate error/failure effects

– Proportional and effectively balanced

– Tunable resilience specialized to application/system

– Analyzable and auto-tuned

21Containment Domains Resilience (c) Mattan Erez

Page 22: Toward Exascale Resilience - University of Texas at Austin

CDs Embed Resilience within Application

•Express resilience as a tree of CDs– Match CD, task, and machine hierarchies

– Escalation for differentiated error handling

•Semantics– Erroneous data never communicated

– Each CD provides recovery mechanism

•Components of a CD– Preserve data on domain start

– Compute (domain body)

– Detect faults before domain commits

– Recover from detected errors

22Containment Domains Resilience (c) Mattan Erez

Root CD

Child CD

Page 23: Toward Exascale Resilience - University of Texas at Austin

Mapping example: SpMV

23Containment Domains Resilience (c) Mattan Erez

𝑴

Matrix M

𝑽

Vector V

void task<inner> SpMV(in M, in Vi, out Ri) {

cd = GetCurrentCD()->CreateAndBegin();

cd->Preserve(matrix, size, kCopy);forall(…) reduce(…)

SpMV(M[…],Vi[…],Ri[…]);cd->Complete();

}

void task<leaf> SpMV(…) {cd = GetCurrentCD()

->CreateAndBegin();cd->Preserve(M, sizeof(M), kRef); cd->Preserve(Vi, sizeof(Vi), kCopy);for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]];cd->CDAssert(idx > prevIdx,

kSoft);prevC=c;

cd->Complete(); }

Page 24: Toward Exascale Resilience - University of Texas at Austin

Mapping example: SpMV

24Containment Domains Resilience (c) Mattan Erez

𝑴𝟎𝟎 𝑴𝟎𝟏

𝑴𝟏𝟎 𝑴𝟏𝟏

Matrix M

𝑽𝟎

Vector V

𝑽𝟏

void task<inner> SpMV(in M, in Vi, out Ri) {

cd = GetCurrentCD()->CreateAndBegin();

cd->Preserve(matrix, size, kCopy);forall(…) reduce(…)

SpMV(M[…],Vi[…],Ri[…]);cd->Complete();

}

void task<leaf> SpMV(…) {cd = GetCurrentCD()

->CreateAndBegin();cd->Preserve(M, sizeof(M), kRef); cd->Preserve(Vi, sizeof(Vi), kCopy);for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]];cd->CDAssert(idx > prevIdx,

kSoft);prevC=c;

cd->Complete(); }

Page 25: Toward Exascale Resilience - University of Texas at Austin

Mapping example: SpMV

25Containment Domains Resilience (c) Mattan Erez

𝑴𝟎𝟎 𝑴𝟎𝟏𝑴𝟏𝟎 𝑴𝟏𝟏𝑽𝟎 𝑽𝟏𝑽𝟎 𝑽𝟏

𝑴𝟎𝟎 𝑴𝟎𝟏

𝑴𝟏𝟎 𝑴𝟏𝟏

Matrix M

𝑽𝟎

Vector V

𝑽𝟏

Distributed to 4 nodes

void task<leaf> SpMV(…) {cd = GetCurrentCD()

->CreateAndBegin();cd->Preserve(M, sizeof(M), kRef); cd->Preserve(Vi, sizeof(Vi), kCopy);for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]];cd->CDAssert(idx > prevIdx,

kSoft);prevC=c;

cd->Complete(); }

Page 26: Toward Exascale Resilience - University of Texas at Austin

Mapping example: SpMV

26Containment Domains Resilience (c) Mattan Erez

𝑴𝟎𝟎 𝑴𝟎𝟏

𝑴𝟏𝟎 𝑴𝟏𝟏

Matrix M

𝑽𝟎

Vector V

𝑽𝟏

Distributed to 4 nodes

void task<leaf> SpMV(…) {cd = GetCurrentCD()

->CreateAndBegin();cd->Preserve(M, sizeof(M), kRef); cd->Preserve(Vi, sizeof(Vi), kCopy);for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]];cd->CDAssert(idx > prevIdx,

kSoft);prevC=c;

cd->Complete(); }

Page 27: Toward Exascale Resilience - University of Texas at Austin

Concise abstraction for complex behavior

27Containment Domains Resilience (c) Mattan Erez

Local copy or regen Sibling Parent (unchanged)

void task<leaf> SpMV(…) {cd = GetCurrentCD()

->CreateAndBegin();cd->Preserve(M, sizeof(M), kRef); cd->Preserve(Vi, sizeof(Vi), kCopy);for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]];cd->CDAssert(idx > prevIdx,

kSoft);prevC=c;

cd->Complete(); }

Page 28: Toward Exascale Resilience - University of Texas at Austin

•Programming and execution model support

28Containment Domains Resilience (c) Mattan Erez

Page 29: Toward Exascale Resilience - University of Texas at Austin

CDs manage preservation, restoration, and re-execution

– Allocate and frees storage

– Transfer data

– Manage default error detection

– Call appropriate CD (hierarchy level) on error/fault

– Holistic error reporting

•Specific policies can be written by the user

– Specialize and tune every aspect of resilience

– Straightforward abstractions

CD abstraction amenable to analysis and auto-tuning

– Analytical model fed with application properties

29Containment Domains Resilience (c) Mattan Erez

Page 30: Toward Exascale Resilience - University of Texas at Austin

– Annotations, persistence, reporting, recovery, tools

30Containment Domains Resilience (c) Mattan Erez

CD-annotated Applications/Libraries

CD Runtime System

Hardware

Compiler Support Debugger

CD-AppMapper

User Interaction for customized error detection /handling / tolerance / injection

Auto-tunerInterface

Profiling & Visualization Interface

CDAuto Tuner

Sight

Scaling Tool(LWM2)

ErrorHandling

Unified Runtime Error Detector

Persistence Layer

State PreservationCommunication

Logging

Runtime Logging

Communication Runtime Library

(Legion + GasNet)

Legion + Libc

BLCRCD-Storage Mapping Interface

Low-Level Machine Check

HW/SW I/F

Error Reporting Architecture

PFSBuddyDRAM SSD

HDD

External Tool

Internal Tool

Future Plan

CD Runtime System Architecture

Page 31: Toward Exascale Resilience - University of Texas at Austin

•CD usage flow

– Annotate

– Profile and extrapolate CD tree

– Supply machine characteristics

– Analyze and auto-tune•Flexible preservation, detection, and recovery

– Refine tradeoffs and repeat

– Execute and monitor•CD management and coordination

•Distributed and hierarchical preservation

•Distributed and hierarchical recovery

31Containment Domains Resilience (c) Mattan Erez

Page 32: Toward Exascale Resilience - University of Texas at Austin

•CD annotations express intent

– CD hierarchy for scoping and consistency

– Preservation directives and hints exploit locality

– Correctness abstractions•Detectors and tolerances

– Recovery customization

– Debug/test interface

•Work in progress: http://lph.ece.utexas.edu/users/CDAPI

32Containment Domains Resilience (c) Mattan Erez

Page 33: Toward Exascale Resilience - University of Texas at Austin

•State preservation and restoration API

– Hierarchical•Per CD (level)

•Match storage hierarchy

•Maximize locality and minimize overhead

– Proportional•Preserve only when worth it (skip preserve calls)

•Exploit inherent redundancy

•Utilize regeneration

33Containment Domains DEGAS/ExMatEx March 2014

Page 34: Toward Exascale Resilience - University of Texas at Austin

Hierarchical local recovery and partial preservation

34Containment Domains Resilience (c) Mattan Erez

Disk

Inter-cabinet

NVM

Inter-node NVM

DRAM

Partial preservation via sibling, parent, or regeneration where appropriate

Page 35: Toward Exascale Resilience - University of Texas at Austin

35Containment Domains Resilience (c) Mattan Erez

Local copy or regen

Sibling

Parent (unchanged)

Page 36: Toward Exascale Resilience - University of Texas at Austin

•Correctness abstractions

– Detectors

– Requirements

– Recovery

36Containment Domains Resilience (c) Mattan Erez

Page 37: Toward Exascale Resilience - University of Texas at Austin

•What can go wrong?

– Application crash

– Process crash

– Process unresponsive

– Failed communication

– Hardware•Cache error

•Memory error

•TLB error

•Node offline

•…

37Containment Domains Resilience (c) Mattan Erez

Page 38: Toward Exascale Resilience - University of Texas at Austin

•What can go wrong?

– Lost resource

– Wrong value•Specific address?

•Specific access?

•Specific computation?

– Degraded resource

•Who detects?

How reported?

38Containment Domains Resilience (c) Mattan Erez

Page 39: Toward Exascale Resilience - University of Texas at Austin

•Today: machine check architecture

– (Maskable) interrupts

– Complex encoding of errors / failures•Spread across many processor-specific state registers

•Very difficult to parse and use

– Currently – level of containment reported•Enables fine-grained software recovery

•Know before state is corrupted

•Know when only process state is corrupted

– Event counters and triggers for errors•Root cause analysis

39(c) Mattan Erez

Page 40: Toward Exascale Resilience - University of Texas at Austin

•Today: machine check architecture

– Not suitable for programmers•Barely suitable for system implementers

•Doable, but tricky and requires a lot of reading

– Varies by vendor

– Continuously updated

40(c) Mattan Erez

Page 41: Toward Exascale Resilience - University of Texas at Austin

•System-provided detectors

–•Control response granularity

•User-specified detectors

•Consistent and unified reporting & analysis

41Containment Domains Resilience (c) Mattan Erez

Page 42: Toward Exascale Resilience - University of Texas at Austin

•Catch the error as soon as possible

– Less to recover

– Ideally smaller and faster preservation

– Micro-rollbacks

– Idempotent regions

– Hardware-level rollbacks

42(c) Mattan Erez

Page 43: Toward Exascale Resilience - University of Texas at Austin

•Idempotent regions and hardware-rollback

– What if hardware can automatically rollback and rexecute?•Fine-grained recovery will have little impact on performance

•Users may not need to do anything

43(c) Mattan Erez

Page 44: Toward Exascale Resilience - University of Texas at Austin

•Instruction retry

– Out-of-order processors

– In-order and GPUs?

44(c) Mattan Erez

Page 45: Toward Exascale Resilience - University of Texas at Austin

•Sophisticated out-of-order offer ample opportunity for hardware retry

– Speculative execution can be used to recovery from soft errors

– ROB and LSQ buffer temporary results

– Transactional memory does to

45(c) Mattan Erez

Page 46: Toward Exascale Resilience - University of Texas at Austin

•Harder in a GPU

– Need to ensure effect-free rollback•No hardware buffering

– Idempotent regions and CDs

– Tradeoffs with hardware buffering and detection latency

46(c) Mattan Erez

Page 47: Toward Exascale Resilience - University of Texas at Austin

Express correctness intent

•Notifies auto-tuner of detection capability

•Enables error elision

•Auto- add redundancy to meet requested level of reliability

•Customize action

47Containment Domains Resilience (c) Mattan Erez

Page 48: Toward Exascale Resilience - University of Texas at Austin

•Analogues to approximate computing research

– Compiler techniques for approximate computing

– Propagate loss of accuracy

– Propagate loss of reliability

48(c) Mattan Erez

Page 49: Toward Exascale Resilience - University of Texas at Austin

•Debug, test, and tools

– Error and failure injection

– Planned integration with low-level injection

– CD profiler, viz, models, and initial tuner in place

49Containment Domains Resilience (c) Mattan Erez

Page 50: Toward Exascale Resilience - University of Texas at Austin

•Quick(ish) way to search the error space

– Multi-mode simulation

– Skip over detectable errors

– Tool to be released•Uses only public tools

Page 51: Toward Exascale Resilience - University of Texas at Austin

51Containment Domains Resilience (c) Mattan Erez

Page 52: Toward Exascale Resilience - University of Texas at Austin

Machine and error models

52Containment Domains Resilience (c) Mattan Erez

Component “Performance” Error Error Scaling

Core 10GFLOP/core Soft error ∝ #cores

Memory 1GB/core ECC fail ∝ #DRAM chips

Socket 200GB/s /socket Hard/OS crash ∝ #sockets

System Hierarchical

network

Power module

or network

∝ #modules

and #cabinets

Page 53: Toward Exascale Resilience - University of Texas at Austin

•Input 1: machine configuration

– Physical and storage hierarchies (capacity and BW)

– Error/failure rates at each level of hierarchy

– Simple power model

•Input 2: application description

– CD tree, including loops of CDs

– Preservation volumes and possible method

– Overlap of preservation and detection with parent

– Execution time estimate

•Analytic model for CD behavior

– Overheads from preservation, detection, and recovery

•Output efficiency

– Performance, energy, memory

53Containment Domains DEGAS/ExMatEx March 2014

Page 54: Toward Exascale Resilience - University of Texas at Austin

Error Failure Recovery

54Containment Domains DEGAS/ExMatEx March 2014

Page 55: Toward Exascale Resilience - University of Texas at Austin

Analytic Model•Leverage hierarchy and CD semantics

– Uncoordinated “local” actions

– Solve in out

•Application abstracted to CDs

– CD tree

– Volumes of preservation, computation, and communication

– Preservation and recovery options per CD

•Machine model

– Storage hierarchy

– Communication hierarchy

– Bandwidths and capacities

– Error processes and rates

55Containment Domains DEGAS/ExMatEx March 2014

Ex

ec

utio

n t

ime

Page 56: Toward Exascale Resilience - University of Texas at Austin

Power model

•CDs that are not re-executing may remain idle

•Actively executing a CD has a relative power of 1

•A node that is idling consumes a relative power of 𝛼– In our experiments 𝛼 = 0.25

56Containment Domains DEGAS/ExMatEx March 2014

Idle

Re

-exe

cu

tio

n t

ime

Parallel domains

Execution Re-execution

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Page 57: Toward Exascale Resilience - University of Texas at Austin

•SPMD-oriented analytical model and tuner

– Extrapolated profile

– Machine characteristics

– Tuning space and models

57Containment Domains Resilience (c) Mattan Erez

Page 58: Toward Exascale Resilience - University of Texas at Austin

•Auto-tuned cross-layer resilience!

– Iterate with error injection

– Intelligent search exploration

58(c) Mattan Erez

Page 59: Toward Exascale Resilience - University of Texas at Austin

•Execution model progress

– Building systems is hard and tricky

– Limited release of single-node runtime

– MPI runtime very close•Lots of distributed programming issues

•Lots of current sad state of FT issues

– Open source soon on Bitbucket• Initially only for soft errors

59Containment Domains Resilience (c) Mattan Erez

Page 60: Toward Exascale Resilience - University of Texas at Austin

•Already useful and collaborations in progress

– Reaching down to hardware in FF2

– Global address space with DEGAS

– Task-based execution in Legion and SWARM

– DSL-facing in Stanford’s PSAAP II

– Algorithmic approach within TOORSES

60Containment Domains Resilience (c) Mattan Erez

Page 61: Toward Exascale Resilience - University of Texas at Austin

•TOORSES fault-tolerant hierarchical solver

– Brian Austin, Eric Roman, and Xiaoye (Sherry) LiLBNL

– Hierarchical semi-separable representation

61Containment Domains Resilience (c) Mattan Erez

Page 62: Toward Exascale Resilience - University of Texas at Austin

•Add CDs at different granularities

– Hierarchical and partial preservation

•Add algorithmic and cheap detection

•Compare to:

– Algorithmic recovery with redundant computation

62Containment Domains Resilience (c) Mattan Erez

0%

20%

40%

60%

80%

100%

1E

-3

3E

-3

1E

-2

3E

-2

1E

-1

3E

-1

1E

+0

Pe

rfo

rma

nce

eff

ien

cy

Error injection rate (#/s)

CoarseMediumFineEncoded

Page 63: Toward Exascale Resilience - University of Texas at Austin

LULESH CD mapping example

63Containment Domains Resilience (c) Mattan Erez

Page 64: Toward Exascale Resilience - University of Texas at Austin

Autotuned CDs perform well

64Containment Domains Resilience (c) Mattan Erez

Peak System Performance

NT

SpMV

HPCCG

0%20%40%60%80%

100%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

Pe

rfo

rma

nc

e

Effic

ien

cy CDs, NT

h-CPR, 80%

g-CPR, 80%

0%20%40%60%80%

100%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

Pe

rfo

rma

nc

e

Effic

ien

cy CDs, SpMV

h-CPR, 50%

g-CPR, 50%

0%20%40%60%80%

100%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

Pe

rfo

rma

nc

e

Effic

ien

cy CDs, HPCCG

h-CPR, 10%

g-CPR, 10%

Page 65: Toward Exascale Resilience - University of Texas at Austin

0%

10%

20%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

En

erg

y

Ov

erh

ea

d

CDs, NT

h-CPR, 80%

gCPR, 80%

CDs improve energy efficiency at scale

65Containment Domains Resilience (c) Mattan Erez

Peak System Performance

0%

10%

20%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

En

erg

y

Ov

erh

ea

d

CDs, SpMV

h-CPR, 50%

g-CPR, 50%

0%

10%

20%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

En

erg

y

Ov

erh

ea

d

CDs, HPCCG

h-CPR, 10%

g-CPR, 10%

NT

SpMV

HPCCG

Page 66: Toward Exascale Resilience - University of Texas at Austin

10X failure rate emphasizes CD benefits

66Containment Domains Resilience (c) Mattan Erez

Peak Performance

0%

20%

40%

60%

80%

100%

0%

20%

40%

60%

80%

100%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

En

erg

y

Ov

erh

ea

d

Pe

rfo

rma

nc

e

Effic

ien

cy

CDs, NT

h-CPR, 80%

0%

20%

40%

60%

80%

100%

0%

20%

40%

60%

80%

100%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

En

erg

y

Ov

erh

ea

d

Pe

rfo

rma

nc

e

Eff

icie

nc

y

CDs, SpMVh-CPR, 50%

0%

20%

40%

60%

80%

100%

0%

20%

40%

60%

80%

100%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

En

erg

y

Ov

erh

ea

d

Pe

rfo

rma

nc

e

Eff

icie

nc

y

CDs, HPCCGh-CPR, 10%

Energy Overhead

Page 67: Toward Exascale Resilience - University of Texas at Austin

•What if my application has many barriers?

– Can’t really form a tree?

67(c) Mattan Erez

Page 68: Toward Exascale Resilience - University of Texas at Austin

SPMV: local recovery and partial preservation

68Containment Domains DEGAS/ExMatEx March 2014

Disk

Remote NVM

Local NVM

DRAM

Partial preservation via sibling or parent where appropriate

Page 69: Toward Exascale Resilience - University of Texas at Austin

Inter-CD communication?

69Containment Domains DEGAS/ExMatEx March 2014

•Relaxed CDs enable inter-CD communication– Maintain CD semantics w/

uncoordinated recovery

– Some data “preserved” via logging

– All communicated data still verified to be correct

•Strict CDs do not communicate – Only communicate when in same

CD context

– Overheads for strict containment can be high

Outer CDs

Inner CDs

Sync+

comm

Page 70: Toward Exascale Resilience - University of Texas at Austin

SPMV: local recovery and partial preservation

70Containment Domains DEGAS/ExMatEx March 2014

Disk

Remote NVM

Local NVM

DRAMPartial preservation via sibling or parent where appropriate

Relaxed CDs

Page 71: Toward Exascale Resilience - University of Texas at Austin

Fun with logging protocols

71(c) Mattan Erez

Page 72: Toward Exascale Resilience - University of Texas at Austin

•What about tasks?

– CDs are great natural fit•CDs + Legion

– Stanford project led by Alex Aiken

•CDs + Swarm

– Spinoff from UDel led by Guang Gao

•Perhaps also with *SS / Nachos

– Barcelona Supercomputing Centers

72(c) Mattan Erez

Page 73: Toward Exascale Resilience - University of Texas at Austin

•Legion resilience

– Propagate failures up the dependence chain

– Utilize region copies to minimize reexecutions

73

Page 74: Toward Exascale Resilience - University of Texas at Austin

74

•Legion + CDs resilience

– Model-guided management of copies

– Optimized reexecution propagation stop points

– Detection and specification semantics

– Integration with other resilience mechanisms

Page 75: Toward Exascale Resilience - University of Texas at Austin

•Use Legion copies for CD preservation

•Optimize for efficiency

– When to add copies

– Where to put copies to survive failures

– When to free copies

•Account for different failure modes and rates

75

Page 76: Toward Exascale Resilience - University of Texas at Austin

76

Page 77: Toward Exascale Resilience - University of Texas at Austin

•Assumption/fear: reliability bounds performance

– Errors may corrupt results and failures kill applications

•What is the error rate?

– Like today: keep ignoring the problem

– Much higher: need detection and recovery

– CDs abstract, scalable, and tunable

•What is the failure rate?

– Like today: hierarchical checkpoint restart

– Higher: specialize preservation and recovery

– CDs are portable and tunable

•Is it really a problem?

– CDs are general and analyzable

– CDs are composable?

77Containment Domains Resilience (c) Mattan Erez

Page 78: Toward Exascale Resilience - University of Texas at Austin

Conclusion

•Containment domains – Abstract constructs for resilience concerns & techniques

– Proportional and application/machine tuned resilience

– Hierarchical & distributed preservation, and recovery

– Analyzable and amendable to automatic optimization

– Scalable with high relative energy efficiency

– Heterogeneous to match emerging architecture

78Containment Domains Resilience (c) Mattan Erez

http://lph.ece.utexas.edu/public/CDs

Page 79: Toward Exascale Resilience - University of Texas at Austin

•Thank You!

– Please find the slides at https://lph.ece.utexas.edu/merez/MattanErez/ExacaleResilienceShort0715

79Containment Domains Resilience (c) Mattan Erez