Upload
mayes
View
49
Download
0
Embed Size (px)
DESCRIPTION
Diskless Checkpointing. 15 Nov 2001. Motivation. Checkpointing on Stable Storage Disk access is a major bottleneck! Incremental Checkpointing Copy-on-write Compression Memory Exclusion Diskless Checkpointing. Diskless?. Extra memory is available (e.g. NOW) Use memory instead of disk - PowerPoint PPT Presentation
Citation preview
Diskless Checkpointing
15 Nov 2001
Motivation
Checkpointing on Stable Storage Disk access is a major bottleneck!
Incremental Checkpointing Copy-on-write Compression Memory Exclusion Diskless Checkpointing
Diskless?
Extra memory is available (e.g. NOW) Use memory instead of disk
Good: Network Bandwidth > Disk Bandwidth
Bad: Memory is not stable
Bottom-line
NOW with (n+m) processors The application runs on exactly n procs,
and should proceed as long as The number of processors in the system is at least n The failures occur within certain constraint
AvailableProcessors (n+m)
ApplicationProcessors (n)
ChkpntProcessors (m)
Overview
Coordinated Chkpnt (Sync-and-Stop)
To checkpoint, Application Proc: Chkpnt the state in memory Chkpnt Proc: Encoding the application chkpnts and
storing the encodings in memory
To recover, Non-failed Procs: Roll-back Replacement processors are chosen. Replacement Proc: Calculate the chkpnts of the failed
procs using other chkpnts & encodings
Outline
Application Processor Chkpnt Disk-based Diskless
Incremental Forked (or copy-on-write)
Optimization
Encoding the chkpnts Parity (RAID level 5) Mirroring 1-Dimensional Parity 2-Dimensional Parity Reed-Solomon CodingReed-Solomon Coding Optimization
Result
Application Processor Chkpnt
Goal
The processor should be able to roll back to its most recent chkpnt.
Need to tolerate failures when chkpnt Make sure that each coordinated chkpnt
remains valid until the next coordinated chkpnt has been completed.
Disk-based Chkpnt
To chkpnt Save all values in the stack,
heap, and registers to disk To recover
Overwrites the address space with the stored checkpoint
Space Demands 2M in disk
(M: the size of an application processor’s address space)
Simple Diskless Chkpnt
To chkpnt Wait until encoding calculated Overwrite diskless chkpnts in
memory To recover
Roll-backed from in-memory chkpnts
Space Demands Extra M in memory
(M: the size of an application processor’s address space)
Incremental Diskless Chkpnt
To chkpnt Initially set all pages R_ONLY On page fault, copy & set RW
To recover Restore all RW pages
Space Demands Extra I in memory
(I: the incremental chkpnt size)
Forked Diskless Chkpnt
To chkpnt Application clones itself
To recover Overwrites state with clone’s Or clone assumes the role of
the application Space Demands
Extra 2I in memory
(I: the incremental chkpnt size)
Optimizations
Breaking the chkpnt into chunks Efficient use of memory
Sending Diffs (Incremental) Bitwise xor of the current copy and chkpnt copy Unmodified pages need not be sent
Compressing Diffs Unmodified regions of memory
Application Processor Chkpnt (review)
Simple Diskless Chkpnt:Extra M in memory
Incremental Diskless Chkpnt:Extra I in memory
Forked Diskless Chkpnt:Extra 2I in memory, less CPU activity
Optimizations:Chkpnt into chunks, diffs, and compressed diffs
Encoding the chkpnts
Goal
Extra chkpnt processors should store enough information that the chkpnts of failed processors may be reconstructed.
Notation: Number of chkpnt processors (m) Number of application processors (n)
To chkpnt,
On failure of ith proc,
Can tolerate: Only one processor failure
Remarks: Chkpnt processor is a bottleneck of
communication and computation
Parity (RAID level 5, m=1)
ApplicationProcessor
ChkpntProcessor
jib
j-th byte ofApplication processor i
jb1jb2
jb3jb4
jckpb
jn
jjjckp bbbb ...21Example
n=4, m=1
jckp
jn
ji
ji
jji bbbbbb ...... 111
Mirroring (m=n)
ApplicationProcessor
ChkpntProcessor
jib
j-th byte ofApplication processor i
jb1jb2
jb3jb4
jckpb 1
Examplen=m=4
jckpb 2
jckpb 3
jckpb 4
To chkpnt,
On failure of ith proc,
Can tolerate: Up to n processor failures Except the failure of both an application
processor and its checkpoint processor Remarks:
Fast, no calculation needed
ji
jckpi bb
jckpi
ji bb
1-Dimensional Parity (1<m<n)
ApplicationProcessor
ChkpntProcessor
jib
j-th byte ofApplication processor i
jb1jb2
jb3jb4
jckpb 1
Examplen=4, m=2
jckpb 2
To chkpnt, Application processors are partitioned
into m groups. ith chkpnt processor calculates the parity
of the chkpnts in group i On failure of ith proc,
Same as in Parity encoding
Can tolerate: One processor failure per group
Remarks: More efficient in communication and
computation
2-Dimensional Parity
ApplicationProcessor
ChkpntProcessor
jib
j-th byte ofApplication processor i
Examplen=4, m=4
To chkpnt, Application processors are arranged
logically in a two-dimensional grid Each chkpnt processor calculates the
parity of the row or the column On failure of ith proc,
Same as in Parity encoding
Can tolerate: Any two-processor failures
Remarks: Multicast
Reed-Solomon Coding (m)
To chkpnt, Vandermonde matrix F, s.t. f(i,j)=j^(i-1) Use matrix-vector multiplication to calculate chkpnt
To recover, Use Gaussian Elimination
Can tolerate: Any m failures
Remarks: Use Galois Fields to perform arithmetic Computation overhead
Optimizations
Sending and calculating the encoding in RAID level 5-based encodings (e.g. Parity)
(a) DIRECT: C1 bottleneck (b) FAN-IN: log(n) step
Encoding the Chkpnts (review)
Parity (RAID level 5, m=1) Only one failure, bottleneck
Mirroring (m=n) Up to n failures (unless both app and chkpnt fail), fast
1-Dimensional Parity One failure per group, more efficient than Parity
2-Dimensional Parity Any two failures, comm overhead w/o multicast
Reed-Solomon Coding Any m failures, computation overhead
DIRECT vs. FAN-IN
Testing Applications (1)
CPU-Intensive parallel programs Instances that took 1.5~2 hrs on 16 processors
NBODY : N-body interactions among particles in a system Particles are partitioned among processors Location field of each particle is updated Expectation:
Poor with incremental chkpnt Good with diff-based compression
MAT : FP matrix product of two square matrices (Cannon’s alg.) All three matrices are partitioned in square blocks among processors In each step, adds the product and passing the input submatrices Expectation:
Incremental chkpnt Very poor with diff-based compression
Testing Applications (2) PSTSWM : Nonlinear shallow water equations on a rotating sphere
Majority pages, but only few bytes per page are modified Expectation:
Poor with incremental chkpnt Good with diff-based compression
CELL : Parallel cellular automaton simulation program Two (sparse) grids of cellular automata (current/next) Expectation:
Poor with incremental chkpnt Good with compression
PCG : Ax=b for a large, sparse matrix First, converted to a small, dense format Expectation:
Incremental chkpnt Very poor with diff-based compression
Diskless Checkpointing
20 Nov 2001
Disk-based vs. Diskless Chkpnt
Disk-based DisklessWhere to chkpnt? In stable storage In local memory
How to recover? Restore from stable storage Re-calculate
Remarks Can tolerate whole failure Cannot tolerate whole failure
Low BW to stable storage Memory is much faster
Encoding (+communication) overhead
Recalculate the lost chkpnt?
Error Detection & Correctionin Digital Communication
Chkpnt Recoveryin Diskless Chkpnt
1-bit Parity (m=1)
Mirroring (m=n)
Remarks-Difference: we can easily know that which node is wrong in chkpnt system.-Some codings can be used to recover from errors in Digital Comm, too. (e.g. Reed-Solomon)
11001011[1] (right)11000011[1] (detectable)11001011[0] (detectable)11000011[0] (oops)
11001011[1] (chkpnt)1100X011[1] (tolerable)11001011[X] (tolerable)1100X011[X] (intolerable)
11001011[11001011] (right)11001011[11001010] (detectable)11001011[00111100] (detectable)11001010[11001010] (oops)
11001011[11001011] (right)11001011[1100101X] (tolerable)11001011[XXXXXXXX] (tolerable)1100101X[1100101X] (intolerable)
Performance
Criteria Latency: time between chkpnt initiated and ready for recovery Overhead: increase in execution time with chkpnt
Applications
NBODY N-body interactionsPSTSWM Simulation of the states on 3-D systemCELL Parallel cellular automaton
MAT FP Matrix multiplication (Canon’s)PCG PCG for sparse matrix
Majority pages, but only few bytes per page are modified
Only small parts are updated, but updated in their entirety
App Description Pattern
Implementation
BASE : No chkpnt DISK-FORK : Disk-based chkpnt w/ fork()
SIMP : Simple diskless INC : Incremental diskless FORK : Forked diskless INC-FORK : Incremental, forked diskless
C-SIMP : w/ diff-based compression C-INC C-FORK C-INC-FORK
Experiment Framework
Network of 24 Sun Sparc5 w/s connected to each other by a fast, switched Ethernet: ~ 5MB/s
Each w/s has 96MB of physical memory 38MB of local disk storage
Disks with bandwidth of 1.7MB/s are connected via Ethernet, and NFS on Ethernet achieved a bandwidth of 0.13 MB/s
Latency: time between chkpnt initiated and ready for recovery Overhead: increase in execution time with chkpnt
Discussion
Latency: diskless has much lower latency than disk-based. Lowers the expected running time of the application in the
presence of failures (has small recovery time) Overhead: comparable…
Recommendations
DISK-FORK: If chkpnt are small If the likelihood of wholesale system failures are high
C-FORK: If many pages, but a few bytes per page are modified
INC-FORK: If not a significant number of pages are modified
Reference
J. S. Plank, K. Li, and M.A. Puening. "Diskless checkpointing." IEEE Transactions on Parallel & Distributed Systems, 9(10):972—986, Oct. 1998