22
ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared- Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University of Illinois at Urbana-Champaign Hewlett-Packard Laboratories Isaac Liu

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Embed Size (px)

Citation preview

Page 1: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

Milos Prvulovic, Zheng Zhang, Josep Torrellas

University of Illinois at Urbana-Champaign

Hewlett-Packard Laboratories

Isaac Liu

Page 2: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

IntroductionTargeting large scale applications

that provide services (need high availability)

Improvements in silicon technology make modern integrated circuits prone to transient and permanent faults

FER vs. BER ◦Hardware redundancy vs. recovery

Page 3: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

ReVive designGoal: Cost-effective general-

purpose rollback recovery◦Modest amount of hardware (cost-

effective)◦Recovery from a wide class of errors

(General-purpose)◦Short system downtime due to error

(high availability)◦Low overhead when error-free (high

performance)

Page 4: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Hardware Modifications

Page 5: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Design Choices◦Checkpoint Storage:

Safe Internal Storage with Distributed parity

Safe External Specialized fault class

◦Checkpoint Separation: Partial separation with Logging Full separation Partial separation with buffering (renaming)

◦Checkpoint Consistency: Global (Un) Coordinated Local

Page 6: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

OverviewPeriodically establish checkpointBetween checkpoints, whenever

main memory written to, log the data to maintain checkpoint state.

If error is detected, then use the logs to roll back state.

Page 7: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Design Choices◦Checkpoint Storage:

Safe Internal Storage with Distributed parity

◦Checkpoint Separation: Partial separation with Logging

◦Checkpoint Consistency: Global

Page 8: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Distributed Parity

Page 9: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Design Choices◦Checkpoint Storage:

Safe Internal Storage with Distributed parity

◦Checkpoint Separation: Partial separation with Logging

◦Checkpoint Consistency: Global

Page 10: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Logging

Page 11: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Design Choices◦Checkpoint Storage:

Safe Internal Storage with Distributed parity

◦Checkpoint Separation: Partial separation with Logging

◦Checkpoint Consistency: Global Checkpoint

Page 12: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Global checkpointCommit all work and states to

main memory.Two phase commit protocol, first

sync is tentative commit, and then sync again to fully commit.

Keeps two most recent checkpoints.

Page 13: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Global Checkpoint

Page 14: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Implementation issuesExtra L bit for each directory

entryNew states in directory protocol,

new messages (parity update/ack)

Race Conditions◦Log-Data Update race◦Atomic Log Update Race◦Log-Parity Update Race◦Data-Parity Update Race◦Checkpoint commit Race

Page 15: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Rollback

Page 16: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

OverheadLogging and parity maintenance

◦Depends on applicationGlobal Checkpoint

◦cross-processor interrupt◦Write dirty data to memory

Rollback◦Recovery + Lost work + Rebuild lost

memory pages

Page 17: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Evaluation environmentCC-NUMA multiprocessor with 16

nodesNon-blocking and write-back

cacheFull-map directory and cache

coherent protocol similar to DASH.

Cache size: ◦16KB for L1, 128kB for L2

*Applications run on smaller problems sizes and shorter periods

Page 18: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Evaluation Results

•Cp10ms – Parity and checkpoint every 10ms•CpInf – Parity and checkpoint with infinite interval•Cp10msM – Mirror and checkpoint every 10ms•CpInfM –Mirror and checkpoint with infinite interval

Page 19: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Traffic

•Par – parity updates•Ckp – checkpoint•WB – writeback•RD/RDX- cache miss•LOG – writing to logs

Page 20: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

Overhead

Page 21: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

ReVive vs. SafetyNetBoth use log-based rollback

mechanismsReVive enables recovery from a

permanent nodeReVive does not need to change

processor’s cacheReVive is more general, so it may

result in larger performance overhead.

Page 22: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University

ConclusionReVive provides:

◦Modest amount of hardware (cost-effective)

◦Recovery from a wide class of errors (General-purpose)

◦Short system downtime due to error (high availability)

◦Low overhead when error-free (high performance)