35
Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Embed Size (px)

Citation preview

Page 1: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Advanced I/O Techniques for Efficient and Highly Available

Process Crash Recovery Protocols

Thesis Presentation

Jason Cornwell03/15/2011

Page 2: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Agenda

• Introduction• Challenges• Pertinent Background• Proposed Techniques• Implementations• Experimental Setup & Results• Conclusions• Future Work

Page 3: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Computing Intensive Applications

Page 4: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Network Centric Services

Page 5: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Recent Advances

Page 6: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Motivation & GoalsDemand for more computing power and

high-bandwidth network connections

Advances in Microprocessors and Networks

Parallel Computing

Performanceand

Scalability

Reliabilityand

Availability

Simplicityand

Accessibility

Page 7: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Agenda

• Introduction• Challenges• Pertinent Background• Proposed Techniques• Implementations• Experimental Setup & Results• Conclusions• Future Work

Page 8: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Reliability Problems

Large numbers of CPUs, Memory Modules, Hard Disk Drives, Network

Interfaces, Network Switches

Low Mean-Time-To-Failure (MTTF)and/or

High Failure-In-Time (FIT)

Page 9: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Classification of Failure• Transient Failure

– Power glitch– System patch and reboot– ECC trap

• Partial “Permanent” Failure– Disk failure– Partial network failure

• Wholesale “Permanent” Failure– Total hardware failure– Natural disaster

Page 10: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Availability Problems

Large numbers Processes, Threads, Software Barriers, Busy Waiting

Temporarily Unresponsiveand/or

Unavailable

Page 11: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Agenda

• Introduction• Challenges• Pertinent Background• Proposed Techniques• Implementations• Experimental Setup & Results• Conclusions• Future Work

Page 12: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Possible Solutions

• Transient Failure– Restart/replay/resume on the same node– Task-migration is possible

• Permanent Partial Failure– Rebalance the workload on surviving nodes– Partial task-migration is needed

• Permanent Wholesale Failure– Reconfigure the applications and services– Massive task-migration to new platform

Page 13: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Checkpointing

• Common feature in high-performance computing (HPC) platforms

• Saves the execution state

• Application or system-level

• Mechanism for task migration

Page 14: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Application vs System Level

• Application-level Recovery Point– Developed application specific– Generally smaller footprint– Data accessiblity restrictions

• Kernel-level Recovery Point– Snapshot processes– Full resource restoration– Flexibility due to system level preemption

Page 15: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Berkeley Labs Checkpoint/Restart

• System-level

• Kernel-module

• Checkpoint creation implemented

• Process recovery implemented

• Linked to BLCR libraries at execution

• Stores checkpoint data locally (stack, heap, registers, signals, etc.)

Page 16: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Agenda

• Introduction• Challenges• Pertinent Background• Proposed Techniques• Implementations• Experimental Setup & Results• Conclusions• Future Work

Page 17: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Contribution

• Enhanced BLCR performance through latency tolerant technique

• Increased BLCR availability through novel checkpoint creation technique

Page 18: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

I/O Optimization

• Avoided extreme modification to BLCR

• Reduce the disk latency of checkpoint creation

• Implemented a caching technique

• Improved I/O performance 4-fold or more

• System overhead less than 300KB in experimental test results

Page 19: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Checkpoint Caching

• Buffer used as temporary storage

• Storage block flushed in large volume

• Trade-off between resource consumption and improved I/O efficiency

cr_copy(chkptData, count)

if(chkptBuf is NULL)

kmalloc size of count for chkptBuf space;

copy chkptData into chkptBuf;

else

kmalloc size of count + chkptBuf size for tempBuf space;

copy chkptBuf into tempBuf;

krealloc chkptBuf for its expanded size;

memmove tempBuf into chkptBuf;

kfree memory for tempBuf;

end if

Page 20: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Optimized Write Operation

Page 21: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Remote Checkpoint

• BLCR is limited to local disk storage

• Remote checkpoint offers off-site storage option

• Uses sockets to transmit data

• Needs predefined destination

• Outperforms BLCR in some experimental tests

Page 22: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Remote Checkpoint Server

• Single thread daemon• Used GCC compiler• Stores the recovery

point external to the client node

• Could be ported to Microsoft derivative

while(true)

create socket;

bind to address;

listen for incoming connections;

wait for client to connect;

create file descriptor;

while(data buffered received)

write checkpoint data;

close file descriptor;

close socket;

Page 23: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Modified Write Operation

• TCP packets• MTU must be

reached before delivery

• Only modification is to the write operation of BLCR

if(remote chkpt)

if(socket is NULL)

create socket;

establish connection, if handshake fails break and perform the original_chkpt;

end if

package checkpoint data;

send data message;

end if

if(original_chkpt)

original BLCR write operation;

end if

Page 24: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Agenda

• Introduction• Challenges• Pertinent Background• Proposed Techniques• Implementations• Experimental Setup & Results• Conclusions• Future Work

Page 25: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Design

I/O Optimization Write

write(chkptData, count)

if(chkptBuf has space for the incoming chkptData)

cr_copy(ckptData, count);

else

vfs_write(chkptBuf);

vfs_write(chkptData);

kfree(chkptBuf);

end if

Remote Checkpoint Write

Page 26: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Agenda

• Introduction• Challenges• Pertinent Background• Proposed Techniques• Implementations• Experimental Setup & Results• Conclusions• Future Work

Page 27: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Experimental Setup

I/O Optimization

• Dell Workstation, 3.06 GHz Intel Pentium 4, 1 GB Memory, 5,400 RPM Hard Disk, Linux 2.6

• BLCR Implementation• Optimized BLCR (O-BLCR)

Implementation

Remote Checkpoint

• Dell PowerEdge 700, 2.80 GHz Dual-processor Intel Pentium 4, 3 GB Memory, 5,400 RPM Hard Disk, Linux 2.6

• Dell Workstation, 3.06 GHz Intel Pentium 4, 1 GB Memory, 5,400 RPM Hard Disk, Linux 2.6

• BLCR Implementation• BLCR with NFS (BLCR+NFS)• BLCR with our Remote

Checkpoint Technique (BLCR+R)

Page 28: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Benchmarks

Program

• NP-Complete• Data Encryption• Linear Equation Solver• File Compression

Resource Utilization

Benchmark CPU Memory I/O

TSP High Low Low

AES High Low Medium

GE Low High High

HC Medium Medium Medium

Page 29: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

I/O Optimization Results

Page 30: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Remote Checkpoint Results

Page 31: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Agenda

• Introduction• Challenges• Pertinent Background• Proposed Techniques• Implementations• Experimental Setup & Results• Conclusions• Future Work

Page 32: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Conclusion

• Minimal modification to BLCR

• I/O optimization technique reduced the write latency of BLCR

• Remote checkpoint increases BLCR availability with new feature

• These techniques should be deployed into the foundation of BLCR source code

Page 33: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Agenda

• Introduction• Challenges• Pertinent Background• Proposed Techniques• Implementations• Experimental Setup & Results• Conclusions• Future Work

Page 34: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Future Work

• Server authentication protocol

• Data packet encryption

• Automated process load balancing

Page 35: Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

Questions