REMEM: REmote MEMory as Checkpointing...

12/20/2010 1

REMEM: REmote MEMory as Checkpointing Storage

Hui Jin Illinois Institute of TechnologyXian-He Sun Illinois Institute of TechnologyYong Chen Oak Ridge National LaboratoryTao Ke Illinois Institute of Technology

CloudCom 2010

12/20/2010 2

OutlineBackground & MotivationREMEM DesignImplementation of REMEM on Open MPIAdaptive Checkpointing Storage SelectionExperimental ResultsConclusions & Future Work

CloudCom 2010

12/20/2010 3

Motivation

Checkpointing is a mostly used mechanism to support fault tolerance in High-Performance Computing environment. However, it introduces considerable overhead due to the expensive

I/O access cost. For a 1-petaFLOPS system, checkpointing can potentially harm the

system performance by 50%.[R. Oldfield al, et 2007]The upcoming Exascale computing environment puts forward even more challenges. 10^18 FLOPS computing power. Millions of computing components. Checkpointing on the centralized parallel file system is not scalable. What if the MTBF < checkpointing cost?

CloudCom 2010

12/20/2010 4

A detailed look of Checkpointing Cost

J. Hursey, al et, "Interconnect Agnostic Checkpoint/Resart in Open MPI", HPDC 2009

CloudCom 2010

12/20/2010 5

Motivation

Memory-based checkpointing is a promising solution to break through the bottleneck from the stable storage. But …Rarely supported by the mainstream of current checkpoint systems. Complexity. Reliability Concern. Excess Memory Usage

CloudCom 2010

REmote MEMory as Checkpiting Storage. Seamless integration with existing checkpointing

sysems. Flexible switch between disk and remote memory

as checkpointing storage. Consideration of reliability and space efficiency.

12/20/2010 CloudCom 2010 6

REMEM – Design Goals

Reliability: Memory is volatile.Scalability: Large-scale environment.Space Efficiency: Memory is precious.Transparency: Augment to existing systems.Flexibility: Switch between the disk and memory.

12/20/2010 CloudCom 2010 7

REMEM Design

12/20/2010 CloudCom 2010 8

REMEM – Node Matching

12/20/2010 CloudCom 2010 9

/ 2 2k kn

k kn k n k

−− + − −−Reliability:

Z. Chen, etc, Fault Tolerant High Performacne Computing by a Coding Approach, PPoPP’05

REMEM – System Configuration

12/20/2010 CloudCom 2010 10

REMEM: Failure Handling

If failures occurs to the source node. If backup node is healthy, simply recovery from

remote memory. If backup node also fails, loads the image from

last disk-based checkpointing.

12/20/2010 CloudCom 2010 11

REMEM: Implementation on Open MPI

Open source MPI-2 implementation that provides a high performance, robust, parallel execution environment for a wide variety of computing environmentsSupports transparent, coordinated checkpoint/restart implementation supported primarily by the BLCR library.

12/20/2010 CloudCom 2010 12

REMEM: Implementation on Open MPI

12/20/2010 CloudCom 2010 13

Adaptive Checkpionting Storage SelectionDisk:Memory:

12/20/2010 CloudCom 2010 14

Experimental Setup

Hardware A 65-node SunFire Cluster. Compute Nodes.

Dual 2.3GHz Opteron quad-core processors and 8GB memory, 250GB 7.2K-RPM SATA hard drive.

OS: Ubuntu enterprise server with Linux kernel 2.6.10

Software: Open MPI v1.3.3 and GCC 4.3.3 REMEM was implemented on the Open MPI with the support of tmpfs

and NFS 3.0.

12/20/2010 CloudCom 2010 15

Experimental Setup

The 64 compute nodes are organized in two groups naturally by the rack id. The nodes from the two groups are mutually mapped for REMEM.4 dedicated X2200 computer nodes configured as PVFS2 servers. Results were obtained for the NAS Parallel Benchmarks (NPB) version 3.3.

12/20/2010 CloudCom 2010 16

REMEM Performance

12/20/2010 CloudCom 2010 17

Problem Size Scaling Performance

12/20/2010 CloudCom 2010 18

Task Scaling Performance

12/20/2010 CloudCom 2010 19

Adaptive Checkpointing Storage Selection

Simulate a cluster of 2048 nodes.For each node, we generate a series of failure arrivals withWeibull distribution. MTBF = 7668 Hours; shape parameter = 0.7

12/20/2010 CloudCom 2010 20

Adaptive Checkpointing Storage Selection -Metrics

12/20/2010 CloudCom 2010 21

Rework Cost

Restart Cost Useful Work

Checkpoint

12/20/2010 CloudCom 2010 22

Performance with Different Number of Processes

12/20/2010 CloudCom 2010 23

Performance with Different Number of I/O Nodes

12/20/2010 CloudCom 2010 24

Performance with Different Checkpointing Interval

Future Work

Release the software.More flexible node matching.How the HPC checkpointing looks like in the cloud?Adopt MapReduce as Checkponiting storage?

12/20/2010 CloudCom 2010 25

Conclusions

It is feasible to implement memory based checkpointing seamlessly.Remote memory is a promising alternative to existing disk as checkpointing storage.Memory should be used in combination with disk to guarantee reliability while achieving efficiency.

12/20/2010 CloudCom 2010 26

Thanks!Questions?http://www.cs.iit.edu/~scs

12/20/2010 27CloudCom 2010

REMEM: REmote MEMory as Checkpointing...

Documents

PROTOCOLO DE CHECKPOINTING - TCC - CAUAN KASPER

Remem Banz As

Coordinated Checkpointing Presented by Sarah Arnold 1

Checkpointing With Minimal Recovery In

Fault Tolerance and Checkpointingcds.iisc.ac.in/wp-content/uploads/FaultTolerance.pdf · Fault Tolerance and Checkpointing - Sathish Vadhiyar. Introduction Checkpointing? storing

IBM Streams V4.1 and Incremental Checkpointing

Armazenamento distribuído de dados e checkpointing de

INCREASING AVAILABILITY OF THE AEPU BY IMPROVING THE ... · LXC Linux Containers MMU Memory Management Unit MTCP MultiThreaded Checkpointing OS Operating System PID Process Identiﬁer

Fault Tolerance and Checkpointing - Sathish Vadhiyar

Hypervisor-Assisted Application Checkpointing for High Availability

Checkpointing 2.0

FAULT TOLERANT SYSTEMS Chapter 6 – Checkpointing I

Implement Checkpointing for Android (ELCE2012)

Experimenal Evaluation of Concurrent Checkpointing and

Speculative Memory Checkpointing

Fast Checkpointing by Write Aggregation with Dynamic …mvapich.cse.ohio-state.edu/static/media/publications/...Memory Usage per Node(MB) at different threshold values Application

Rebound: Scalable Checkpointing for Coherent Shared Memory

Privacy-preserving Virtual Machine Checkpointing Mikhail I ... · VM checkpointing saves a persistent snapshot (or a checkpoint) of the entire memory and disk state of a VM in execution,

Ch13 Checkpointing and Recovery

Checkpointing Transaction-based Distributed Shared Memory ...static.cs.brown.edu/research/pubs/theses/masters/1999/chen.pdf · Checkpointing Transaction-based Distributed Shared Memory