REMEM: REmote MEMory as Checkpointing...

Preview:

Citation preview

12/20/2010 1

REMEM: REmote MEMory as Checkpointing Storage

Hui Jin Illinois Institute of TechnologyXian-He Sun Illinois Institute of TechnologyYong Chen Oak Ridge National LaboratoryTao Ke Illinois Institute of Technology

CloudCom 2010

12/20/2010 2

OutlineBackground & MotivationREMEM DesignImplementation of REMEM on Open MPIAdaptive Checkpointing Storage SelectionExperimental ResultsConclusions & Future Work

CloudCom 2010

12/20/2010 3

Motivation

Checkpointing is a mostly used mechanism to support fault tolerance in High-Performance Computing environment. However, it introduces considerable overhead due to the expensive

I/O access cost. For a 1-petaFLOPS system, checkpointing can potentially harm the

system performance by 50%.[R. Oldfield al, et 2007]The upcoming Exascale computing environment puts forward even more challenges. 10^18 FLOPS computing power. Millions of computing components. Checkpointing on the centralized parallel file system is not scalable. What if the MTBF < checkpointing cost?

CloudCom 2010

12/20/2010 4

A detailed look of Checkpointing Cost

J. Hursey, al et, "Interconnect Agnostic Checkpoint/Resart in Open MPI", HPDC 2009

CloudCom 2010

12/20/2010 5

Motivation

Memory-based checkpointing is a promising solution to break through the bottleneck from the stable storage. But …Rarely supported by the mainstream of current checkpoint systems. Complexity. Reliability Concern. Excess Memory Usage

CloudCom 2010

REMEM

REmote MEMory as Checkpiting Storage. Seamless integration with existing checkpointing

sysems. Flexible switch between disk and remote memory

as checkpointing storage. Consideration of reliability and space efficiency.

12/20/2010 CloudCom 2010 6

REMEM – Design Goals

Reliability: Memory is volatile.Scalability: Large-scale environment.Space Efficiency: Memory is precious.Transparency: Augment to existing systems.Flexibility: Switch between the disk and memory.

12/20/2010 CloudCom 2010 7

REMEM Design

12/20/2010 CloudCom 2010 8

REMEM – Node Matching

12/20/2010 CloudCom 2010 9

/ 2 2k kn

kn

CC

11 1

k kn k n k

kn

C CC

−− + − −−Reliability:

Z. Chen, etc, Fault Tolerant High Performacne Computing by a Coding Approach, PPoPP’05

REMEM – System Configuration

12/20/2010 CloudCom 2010 10

REMEM: Failure Handling

If failures occurs to the source node. If backup node is healthy, simply recovery from

remote memory. If backup node also fails, loads the image from

last disk-based checkpointing.

12/20/2010 CloudCom 2010 11

REMEM: Implementation on Open MPI

Open source MPI-2 implementation that provides a high performance, robust, parallel execution environment for a wide variety of computing environmentsSupports transparent, coordinated checkpoint/restart implementation supported primarily by the BLCR library.

12/20/2010 CloudCom 2010 12

REMEM: Implementation on Open MPI

12/20/2010 CloudCom 2010 13

Adaptive Checkpionting Storage SelectionDisk:Memory:

12/20/2010 CloudCom 2010 14

Experimental Setup

Hardware A 65-node SunFire Cluster. Compute Nodes.

Dual 2.3GHz Opteron quad-core processors and 8GB memory, 250GB 7.2K-RPM SATA hard drive.

OS: Ubuntu enterprise server with Linux kernel 2.6.10

Software: Open MPI v1.3.3 and GCC 4.3.3 REMEM was implemented on the Open MPI with the support of tmpfs

and NFS 3.0.

12/20/2010 CloudCom 2010 15

Experimental Setup

The 64 compute nodes are organized in two groups naturally by the rack id. The nodes from the two groups are mutually mapped for REMEM.4 dedicated X2200 computer nodes configured as PVFS2 servers. Results were obtained for the NAS Parallel Benchmarks (NPB) version 3.3.

12/20/2010 CloudCom 2010 16

REMEM Performance

12/20/2010 CloudCom 2010 17

Problem Size Scaling Performance

12/20/2010 CloudCom 2010 18

Task Scaling Performance

12/20/2010 CloudCom 2010 19

Adaptive Checkpointing Storage Selection

Simulate a cluster of 2048 nodes.For each node, we generate a series of failure arrivals withWeibull distribution. MTBF = 7668 Hours; shape parameter = 0.7

12/20/2010 CloudCom 2010 20

Adaptive Checkpointing Storage Selection -Metrics

12/20/2010 CloudCom 2010 21

Rework Cost

Restart Cost Useful Work

Checkpoint

Adaptive Checkpointing Storage Selection

12/20/2010 CloudCom 2010 22

Performance with Different Number of Processes

Adaptive Checkpointing Storage Selection

12/20/2010 CloudCom 2010 23

Performance with Different Number of I/O Nodes

Adaptive Checkpointing Storage Selection

12/20/2010 CloudCom 2010 24

Performance with Different Checkpointing Interval

Future Work

Release the software.More flexible node matching.How the HPC checkpointing looks like in the cloud?Adopt MapReduce as Checkponiting storage?

12/20/2010 CloudCom 2010 25

Conclusions

It is feasible to implement memory based checkpointing seamlessly.Remote memory is a promising alternative to existing disk as checkpointing storage.Memory should be used in combination with disk to guarantee reliability while achieving efficiency.

12/20/2010 CloudCom 2010 26

Thanks!Questions?http://www.cs.iit.edu/~scs

12/20/2010 27CloudCom 2010

Recommended