12
Distributed Replication Mechanism for Building Fault Tolerant System with Distributed Checkpoint Mechanism Hideaki Hirayama Information & Communication Systems Laboratory, Toshiba Corporation, Ome, Japan 198-0025 Toshio Shirakihara Communication and Information Systems Research Laboratories, Research and Development Center, Toshiba Corporation, Kawasaki, Japan 210-0901 Tatsunori Kanai Information & Communication Systems Laboratory, Toshiba Corporation, Ome, Japan 198-0025 SUMMARY A distributed fault tolerant middleware system called ARTEMIS (Advanced Reliable disTributed Environment MIddleware System) was developed for the purpose of building fault tolerant systems without modifying either the source code or the binary code of application programs in open systems. In ARTEMIS, we proposed the Distributed Replication mechanism for building fault tolerant systems with the Distributed Checkpoint mechanism. In the Distrib- uted Replication mechanism, a server computer is config- ured with a primary server computer and a backup server computer. If a failure occurs, it recovers the system in a few seconds to tens of seconds. Its overhead is 10% to 20% in TPC-C Benchmark tests. ' 2000 Scripta Technica, Syst Comp Jpn, 31(5): 1223, 2000 Key words: Checkpoint; rollback; reliability; availability; fault tolerance. 1. Introduction When we build an enterprise information system with open systems, we must pay attention to its reliability. In particular, a failure in a server computer may take a whole system down. Therefore, in many mission-critical systems, a server computer is configured as a High-Availability system [1, 2] with a primary server and a backup server. Thus, if the primary server computer goes down, its func- tions must be taken over by the backup server computer. In a database management system, journal recovery of the database management system must be executed if the pri- mary computer goes down and the backup server computer takes over the processes which were being executed in the primary server computer. The processes cannot be taken over quickly because the journal recovery process takes a few minutes to tens of minutes. Fault tolerant computers [3, 4] have traditionally been used for solving those problems. But it is impossible to develop traditional fault tolerant computers in open systems, because of cost and standardization considera- ' 2000 Scripta Technica Systems and Computers in Japan, Vol. 31, No. 5, 2000 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J82-D-I, No. 3, March 1999, pp. 496507 12

Distributed replication mechanism for building fault tolerant system with distributed checkpoint mechanism

Embed Size (px)

Citation preview

Distributed Replication Mechanism for Building Fault Tolerant

System with Distributed Checkpoint Mechanism

Hideaki Hirayama

Information & Communication Systems Laboratory, Toshiba Corporation, Ome, Japan 198-0025

Toshio Shirakihara

Communication and Information Systems Research Laboratories, Research and Development Center, Toshiba Corporation,

Kawasaki, Japan 210-0901

Tatsunori Kanai

Information & Communication Systems Laboratory, Toshiba Corporation, Ome, Japan 198-0025

SUMMARY

A distributed fault tolerant middleware system called

�ARTEMIS� (Advanced Reliable disTributed Environment

MIddleware System) was developed for the purpose of

building fault tolerant systems without modifying either the

source code or the binary code of application programs in

open systems. In ARTEMIS, we proposed the Distributed

Replication mechanism for building fault tolerant systems

with the Distributed Checkpoint mechanism. In the Distrib-

uted Replication mechanism, a server computer is config-

ured with a primary server computer and a backup server

computer. If a failure occurs, it recovers the system in a few

seconds to tens of seconds. Its overhead is 10% to 20% in

TPC-C Benchmark tests. © 2000 Scripta Technica, Syst

Comp Jpn, 31(5): 12�23, 2000

Key words: Checkpoint; rollback; reliability;

availability; fault tolerance.

1. Introduction

When we build an enterprise information system with

open systems, we must pay attention to its reliability. In

particular, a failure in a server computer may take a whole

system down. Therefore, in many mission-critical systems,

a server computer is configured as a High-Availability

system [1, 2] with a primary server and a backup server.

Thus, if the primary server computer goes down, its func-

tions must be taken over by the backup server computer. In

a database management system, journal recovery of the

database management system must be executed if the pri-

mary computer goes down and the backup server computer

takes over the processes which were being executed in the

primary server computer. The processes cannot be taken

over quickly because the journal recovery process takes a

few minutes to tens of minutes.

Fault tolerant computers [3, 4] have traditionally

been used for solving those problems. But it is impossible

to develop traditional fault tolerant computers in open

systems, because of cost and standardization considera-

© 2000 Scripta Technica

Systems and Computers in Japan, Vol. 31, No. 5, 2000Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J82-D-I, No. 3, March 1999, pp. 496�507

12

tions. We therefore proposed a middleware system

�ARTEMIS� (Advanced Reliable disTributed Environment

MIddleware System) [5�7] which builds a new type of fault

tolerant system suited to open systems. ARTEMIS takes

checkpoints of processes which need higher reliability. If a

failure occurs, ARTEMIS restarts the processes from the

checkpoints. This technique is known as the Distributed

Checkpoint mechanism [8].

ARTEMIS assumes that a High-Availability system

detects a failure and proposes the following mechanism for

building fault tolerant systems in open systems:

x the Distributed Checkpoint mechanism, which

supports not only message communications but

also interprocess communications

x the Distributed Replication mechanism, which

builds fault tolerant systems with duplicated serv-

er computers

x the Jacket Routine mechanism, which adds the

above mechanisms without updating the source

code or binary code of application programs

This paper presents both the mechanism and the

evaluation of Distributed Replication which builds fault

tolerant systems with duplicated server computers.

2. System Architecture of ARTEMIS

2.1. Checkpoint and rollback mechanism

The basic mechanism of ARTEMIS is a per-process

checkpoint and rollback mechanism, shown in Fig. 1. Proc-

esses which should be reliable are executed under the

control of ARTEMIS and checkpoints are taken for them

about every second. If a failure occurs, all the processes

which are executed under the control of ARTEMIS are

restarted from the last checkpoint. The mechanism shown

in Refs. 9 and 10 also makes the system reliable based on

the checkpoint and rollback mechanism. But since its

checkpoint and rollback mechanism is the per-system

checkpoint and rollback mechanism, the checkpoint has the

whole system image, including the OS.

The checkpoints in ARTEMIS have the following

states.

x the address space image and processor context of

the process

x the services provided by the OS to the process

The address space image is saved by the incremental

checkpoint mechanism [11]. In the mechanism, all the

pages of the address space including writable pages are

write-protected. If a page fault occurs against the pages,

ARTEMIS records the modified page numbers. When

ARTEMIS takes checkpoints, it saves only the modified

pages.

The services provided by the OS are saved as infor-

mation by hooking the system calls which are called by the

processes. If a failure occurs, the services provided by the

OS to the process are recovered by reissuing the system

calls based on the information. As an example in file opera-

tions, the files which are being opened when taking the

checkpoint are reopened with the same file descriptors and

the file pointers are set to the position at the time of the

checkpoint. How to recover the data in the files is presented

in Section 3.

Semaphores and shared memory segments cannot be

acquired with the same IDs. Therefore, ARTEMIS treats the

IDs of semaphores and shared memory segments that ap-

plication programs access as virtual IDs. ARTEMIS treats

the usual IDs as physical IDs. The virtual IDs and the

physical IDs have a one-to-one relation. If a failure occurs

and ARTEMIS restarts the processes, ARTEMIS sets the

new IDs acquired to the physical IDs. The application

programs access semaphores and shared memory segments

by the virtual IDs which are the same before the failure

occurs. ARTEMIS hooks the system calls which are called

by processes and converts the virtual IDs to the physical

IDs. Therefore, the IDs of semaphores and shared memory

segments are changed by reacquisition when ARTEMIS

restarts the processes, but the processes can use the same

IDs as were used before the failure occurs.

2.2. Jacket routine

ARTEMIS provides fault recovery functions based

on the checkpoint and rollback mechanism to application

programs without modifying the source code or binary code

of those programs. To provide such functions, ARTEMIS

Fig. 1. Principle of fault recovery mechanism of

ARTEMIS.

13

hooks the system calls which are called by the application

processes. We call this mechanism the �Jacket Routine.�

ARTEMIS has library functions which have the same

names as the system calls which are hooked. Those func-

tions collect the information for fault recovery and then call

the actual system functions. Those functions are linked

instead of the actual system functions. Thus, when applica-

tion processes make system calls, the jacket routines are

called; they collect information for fault recovery and then

call the actual system functions.

Figure 2(1) shows how the standard functions are

called normally. On the other hand, Fig. 2(2) shows how the

standard functions are called under the control of

ARTEMIS. As an example, standard functions are stored in

the library LIBC. The standard functions are subdivided

into the following three types from the viewpoint of access-

ing the kernel:

x SYS1

x LIB1

x LIB2

SYS1 includes system call functions which issue a

trap instruction to access the kernel. Open is an example of

this type of function. LIB1 includes simple functions which

execute internally and do not access the kernel. Strcmp is

an example of this type of function. LIB2 includes indirect

system call functions which call SYS1-type functions.

Fopen is an example of this type of functions.

Under the control of ARTEMIS, the jacket routine

library functions are linked instead of LIBC functions.

Thus, SYS1-type functions are referenced from the jacket

routine library, LIB1-type functions are referenced from the

standard library, and LIB2-type functions are referenced

from the standard library, but the functions indirectly called

are referenced from the jacket routine library.

2.3. Distributed checkpoint

In ARTEMIS, we extend the per-process checkpoint

and rollback mechanism to support the whole distributed

system. Therefore, ARTEMIS takes checkpoints of multi-

ple processes in multiple computers.

We now consider how to take checkpoints of multiple

processes which cooperate with each other with communi-

cation functions such as the socket interface. If multiple

processes cooperate with connection-oriented communica-

tion such as TCP, there is a problem in taking checkpoints.

We consider the following situations.

(1) A sender process is checkpointed just after it sends

data to a receiver process.

(2) The receiver process is checkpointed just before

it receives the data.

(3) A failure occurs and the processes are restarted

from the checkpoints.

In this case, the sender process had already sent the

data, but the receiver process had not received the data yet.

Thus, the data are lost.

Next we consider the following situation.

(1) A sender process is checkpointed just before it

sends data to a receiver process.

(2) The receiver process is checkpointed just after it

receives the data.

(3) A failure occurs and the processes are restarted

from the checkpoints.

In this case, the sender process has not sent the data,

but the receiver process had already received the data. The

sender process sends the data again and the receiver process

receives the same data again. Thus, the data are duplicated.

To solve the above problem, ARTEMIS takes check-

points of processes which cooperate with connection-ori-

ented communication such as TCP when the receiver

process has received all the data which were sent by the

sender process. Figure 3 shows the protocol. In the first

phase, processes are inhibited to send data and to tell the

amount of data which they have already sent to the receiver

processes. In the second phase, the receiver processes re-

ceive that amount of data and then all the processes take

checkpoints simultaneously. To arbitrate the processes,

there is a coordinate daemon in each computer.

On the other hand, in connectionless communication

such as UDP, loss of transmission data and duplicated

transmission are permitted so that it is not necessary to

coordinate the processes with the above protocol. The Dis-

tributed Checkpoint protocol is subdivided into coordinated

checkpoint and autonomous checkpoint types [8]. InFig. 2. Mechanism of jacket routine.

14

ARTEMIS, we have proposed two-phase protocol which is

one of the coordinated checkpoint protocols.

3. Distributed Replication Mechanism

3.1. Building fault tolerant system

The Distributed Checkpoint mechanism provides the

basic functions to recover processes from a failure. But it is

not able to increase availability in real-world systems. The

following failures exist in real-world computer systems:

x process abort

x OS panic

x hardware failure

The traditional Distributed Checkpoint mechanism

treats only �process abort.� Of course, if checkpoints are

saved in nonvolatile memory, when the system crashes

owing to an OS panic or hardware failure, it can restart

processes after the system is rebooted. But it cannot recover

the system immediately and thus it is not possible to build

fault-tolerant systems.

In ARTEMIS, we propose the Distributed Check-

point mechanism for the purpose of building fault-tolerant

systems with the Distributed Checkpoint mechanism. In

that mechanism, a server computer is configured with a

primary server computer and a backup server computer.

ARTEMIS sends checkpoints taken in the primary server

computer to the backup server computer and updates files

in the primary server computer as well as files in the backup

server computers. Thus, processes can be restarted in both

the primary server computer and the backup server com-

puter.

This Distributed Replication mechanism is not pro-

vided by other systems.

3.2. Mechanism of distributed replication

Figure 4 shows the basic mechanism of Distributed

Replication. In the Distributed Replication mechanism,

ARTEMIS hooks file update operations executed by the

processes in primary servers and collects the file update

information (1). For the write operation, the file update

information includes the following:

x file descriptor

x file seek pointer

x data length

x write data

The file update information is buffered in the primary

server computer and the data in the buffer are sent to the

backup server computer when the buffer is filled with data

(2). The file update information sent to the backup server

computer is just linked to the �nondeterministic queue�

without executing the file operations based on the file

update information in the backup server computer. When a

checkpoint is taken, the file update information buffered in

the primary server computer is flushed to the backup server

computer. In the backup server computer, the file update

information linked to the nondeterministic queue is moved

to the �deterministic queue� (3). The file operations based

on the file update information linked to the deterministic

queue are executed in the backup server computer after the

next checkpoint is taken (4). Therefore, under the control

of the Distributed Replication mechanism, the file update

operations executed in the primary server computers are

executed in the backup server computers after the check-

point is taken.

Fig. 3. Protocol of distributed checkpoint.

Fig. 4. Mechanism of distributed replication.

15

Figure 5 shows how files in the backup server com-

puter are updated identically to those in the primary server

computer. First, file update operations �A,� �B,� and �C�

are executed in the primary server computer. Then check-

point �CP1� is taken and file update information �A,� �B,�

and �C� is sent to the backup server computer to be held.

After checkpoint �CP1,� the file operations based on file

update information �A,� �B,� and �C� are executed in the

backup server computer. Then file update operations �D�

and �E� are executed in the primary server computer. Then

checkpoint �CP2� is taken and file update information �D�

and �E� is sent to the backup server computer to be held.

After the checkpoint �CP2,� the file operations based on

file update information �D� and �E� are executed in the

backup server computer.

3.3. Recovery from process abort

In ARTEMIS, if a process running in the primary

server computer aborts, all processes are restarted from the

checkpoint in the same computers. That is, processes which

are running in the primary server computer are restarted in

the same primary server computer. Thus, before restarting

the processes, it is necessary to roll back the files to the

states before the checkpoint in both the primary server

computer and the backup server computer.

Figure 6 shows the mechanism used to roll back files

to the checkpoint in both the primary server computer and

the backup server computer when a process aborts. When

a process running in the primary server computer aborts,

ARTEMIS terminates all processes and flushes the file

update information remaining in the buffer of the primary

server computer to the backup server computer (1). In the

backup server computer, ARTEMIS executes the file opera-

tions based on the file update information linked to the

deterministic queue and updates the files in the backup

server computer (2), because the file update information

linked to the deterministic queue is based on the operations

executed before the checkpoint. On the other hand,

ARTEMIS does not execute file operations based on the file

update information linked to the nondeterministic queue,

because the file update information linked to the nondeter-

ministic queue is based on operations executed after the

checkpoint. By executing the above operations, the files in

the backup server computer are rolled back to the check-

point.

But the files in the primary server computer were

updated after the checkpoint. The file update information is

linked to the nondeterministic queue in the backup server

computer and the �before� image data exist in the files in

the backup server computer. Thus, ARTEMIS builds file

recovery information from the file update information

linked to the nondeterministic queue in the backup server

computer and the �before� image data in the files in the

backup server computer (3). As an example, if the operation

is a file write, ARTEMIS reads the �before� image data

from the files in the backup server instead of executing the

file update operations, then sends the file recovery informa-

tion to the primary server computer, and executes the file

operations based on the file recovery information (4). Thus,

the files in the primary server computer are restored to the

states at the checkpoint.

Figure 7 shows how the files in the primary and

backup server computers are restored to the states of the

checkpoint by the Distributed Checkpoint mechanism

when a process aborts. First, file update operations such as

A, B, and C are executed in the primary server computer.

Then, checkpoint CP1 is taken. Up to this time, the file

update information A, B, and C is sent to the backup server

Fig. 5. File update under control of distributed

replication mechanism.

Fig. 6. Recovery mechanism from process abort.

16

computer. After the checkpoint, file update operations A, B,

and C are executed in the backup server computer. On the

other hand, file update operations D and E are executed in

the primary server computer. After that, a process which

was being executed in the primary server computer aborts.

In this case, all processes are restarted from the checkpoint

in the same server computers. At this time, if file update

information D and E remains in the primary server com-

puter, it is flushed to the backup server computer. Then, file

update operations A, B, and C before the checkpoint are

executed in the backup server computer. And the �before�

image data are read from the files in the backup server

computer based on the file update information linked to the

nondeterministic queue. Then the files in the primary server

computer are recovered from the file update information.

3.4. Recovery from system crash

If the primary server computer goes down as a result

of an OS panic or hardware failure, ARTEMIS restarts the

processes which were running in the primary server com-

puter from the checkpoint in the backup server computer.

At this time, before the processes are restarted, the files in

the backup server computer are restored to the states at the

checkpoint.

Figure 8 shows how ARTEMIS restores the files in

the backup server computer to the states at the checkpoint

when the primary server computer goes down. When the

primary server computer goes down, the processes which

were running in the primary server computer are restarted

in the backup server computer. Before the processes are

restarted, the files in the backup server computer are re-

stored to the states at the checkpoint. ARTEMIS executes

the file operations based on the file update information

linked to the deterministic queue in the backup server

computer (1), because the file update information is based

on file operations which were executed before the check-

point. On the other hand, the file update information linked

to the nondeterministic queue is deleted instead of execut-

ing it (2), because the file update information is based on

file operations which were executed after the checkpoint.

Thus, the files in the backup server computer are restored

to the states of the checkpoint, if the primary server com-

puter goes down.

Figure 9 shows how the files in the backup server

computer are restored to the states at the checkpoint by the

Distributed Replication mechanism when the primary serv-

er goes down. First, the file update operations A, B, and C

are executed in the primary server.

Then, checkpoint CP1 is taken and file update infor-

mation A, B, and C is sent to the backup server. After

checkpoint CP1, the file update operations based on file

update information A, B, and C are executed in the backup

server. On the other hand, file update operations D and E

are executed in the primary server computer. After that, the

primary server goes down. In this case, the processes which

Fig. 7. File recovery from process abort.

Fig. 8. Recovery mechanism from system abort.

Fig. 9. File recovery from system down.

17

were running in the primary server computer are restarted

in the backup server computer. Before restarting the proc-

esses, the file update operations based on file update infor-

mation A, B, and C linked to the deterministic queue which

were executed before checkpoint CP1 in the primary server

computer are executed in the backup server computer, and

file update information D and E linked to the nondetermin-

istic queue is deleted instead of being executed. In this way,

the Distributed Replication mechanism hooks the file up-

date operations executed in the primary server computer

and gets the file update information. And based on the file

update information, it updates files in the backup server

computer identically to those in the primary server com-

puter, and restores the files in the backup server computer

to the states at the checkpoint if the primary server computer

goes down. Until now, we have discussed only write opera-

tions for the file update operations, but actually other file

update operations, such as file creation, deletion, opening,

closing, and others, are treated in the same way.

3.5. Reduplication

When the primary server computer goes down and

the backup server computer takes over processing, the

server computer is configured as nonduplicated. In the

nonduplicated configuration, ARTEMIS cannot recover the

system from a further failure. The Distributed Replication

mechanism provides a function to build a new backup

server computer into the system after it is rebooted. For

configuring the system as a reduplicated system again,

ARTEMIS saves the information about the OS services

after the primary server computer goes down and the

backup server computer takes over the processing. But the

file update information is not saved because it is too large

to hold. When ARTEMIS starts reduplication, ARTEMIS

restarts to save the file update information, then copies the

following files from the primary server to the backup serv-

er:

x files which are being opened in write mode

x files which were opened but now are closed

When the copying process is completed, ARTEMIS

takes a checkpoint. If the files are updated while they are

being copied, there is no problem, because that file update

information is saved and transferred to the backup server

computer at the first checkpoint after the reduplication, and

the file operations based on the file update information are

executed in the backup server computer after the next

checkpoint. Thus, it is possible to reduplicate the server

computer without stopping the system.

4. Implementation and Evaluation

4.1. Implementation

ARTEMIS has been implemented on Solaris, and

many application programs, such as Oracle DBMS, have

been run under the control of ARTEMIS. We determined

that the recovery functions were correctly executed. When

we set the checkpoint interval to 500 ms, it takes two or

three seconds to restart the processes from the checkpoint.

If we consider the mechanism, it takes a longer time to

recover the system from a failure caused by a process abort

than to recover from a failure caused by a system down, but

there is no visible difference; the reason is that we cannot

measure the difference.

Actually, it takes a longer time to detect the system

down than to recover the system from the system down,

because a process abort is detected quickly, but it takes a

few seconds to a few tens of seconds to detect a system

down. Thus, ARTEMIS can recover Oracle DBMS in a few

seconds, but the total recovery time including the time to

detect the system down ranges from a few seconds to a few

tens of seconds.

Figure 10 shows a three-tier application program

executed under the control of ARTEMIS. In each computer,

a coordination daemon is running to coordinate the taking

checkpoints with other computers. In the server computers,

the Distributed Replication daemons are running to repli-

cate the files between the primary server computer and the

backup server computer and to recover the files when a

failure occurs.

Fig. 10. Three-tier client�server-type application

program under control of ARTEMIS.

18

4.2. Measurement

ARTEMIS provides fault recovery without

modifying application programs. But there is an over-

head when application programs are executed under

the control of ARTEMIS. We evaluated the overhead

with the TPC-C benchmark [12].

We ran Oracle DBMS in server computers con-

sisting of 40-MHz SuperSPARCs and Solaris 2.4. The

primary server computer and the backup server com-

puter were connected via a 100-Mbps Ethernet. The

checkpoint interval was 500 ms. Table 1 shows the

checkpoint time and the data size.

The first checkpoint phase takes 38 ms, the sec-

ond checkpoint phase takes 212 ms, and the check-

point transfer time is 343 ms. The first checkpoint

phase and the checkpoint transfer are executed concur-

rently with normal processing. But the second check-

point phase cannot be executed simultaneously with

the normal process. It takes 500 ms after the comple-

tion of checkpoint transfer before the start of the next

checkpoint.

In the checkpoint transfer, ARTEMIS transfers the

checkpoints and the file update information from the pri-

mary server computer to the backup server computer. The

transfer data size of the data and stack is 325 KB, the

transfer data size of the shared memory is 24 KB, and the

transfer data size of the file update information is less than

1 KB, so that the total data size is 349 KB.

Table 2 shows the checkpoint time and data size per

tpmC (TPC-C Transactions Per Minute) for executing TPC-

C benchmark.

In the TPC-C benchmark, when the load increases by

1 tpmC, the first checkpoint phase time increases by 0.2 ms,

the second checkpoint phase time increases by 3.2 ms, the

checkpoint transfer time increases by 1.3 ms, the total

checkpoint time increasing by 4.7 ms; the transfer data size

of data and stack increases by 2.3 KB, the transfer data size

of shared memory increases by 10.6 KB, the transfer data

size of file update information increases by 1.8 KB, the total

transfer data size increasing by 14.7 KB.

The overhead of ARTEMIS is divided into two

parts. The first is the CPU time for the second checkpoint

phase, and the second is the CPU time to record the

update pages and to get the file update information. They

are executed concurrently with normal processing. Table

3 shows the ratio of the used time in checkpoint process-

ing and in normal processing.

Table 4 shows the ratio of the time used by the

ARTEMIS daemons in the second checkpoint phase and

the time used by the jacket routines.

Tables 5, 6, 7, and 8 show the checkpoint time and

the data size in the following environments, respectively:

x 40-MHz SuperSPARC, Solaris 2.4, 100-Mbps

Ethernet

x 40-MHz SuperSPARC, Solaris 2.4, 10-Mbps Eth-

ernet

x 50-MHz SuperSPARC, Solaris 2.4, 10-Mbps Eth-

ernet

x UltraSPARC, Solaris 2.5.1, 10-Mbps Ethernet

The following processes are executed, respectively:

Table 1. Checkpoint time and data size of oracle

Process/data size Time

CP#1 38 ms

CP#2 212 ms

CP transfer 343 ms

CP interval 500 ms

Data + stack 325 KB

Shared memory 24 KB

Files 0 KB

Table 2. Checkpoint time and data size

necessary for 1 tpmC

Process/data size Time

CP#1 0.2 ms

CP#2 3.2 ms

CP transfer 1.3 ms

Data + stack 2.3 KB

Shared memory 10.6 KB

Files 1.8 KB

Table 3. Ratio of used time in checkpoint processing

and normal processing

Processin CP

process

in normal

process

ARTEMIS daemon 95% 5%

Jacket routines 84% 16%

Table 4. Ratio of checkpoint time by ARTEMIS

daemons and jacket routines

Process Ratio

Ratio of ARTEMIS daemon in CP 22%

Ratio of Jacket Routine in CP 78%

19

x No Proc: no processes are executed

x 1 Sleep Proc: one sleep process is executed

x 7 Sleep Proc: seven sleep processes are executed

4.3. Evaluation

The overhead of ARTEMIS in TPC-C benchmark

is divided into the following two parts.

x Fixed Overhead (FO):

The overhead does not depend on the number of

transactions.

x Variable Overhead (VO):

The overhead depends on the number of transac-

tions. VO is the total of VO1, VO2, VO3, and VO4 shown

in Table 9.

FO and VO in the 40-MHz SuperSPARC are de-

fined by the following expression according to Tables 1,

2, 3, and 4:

Table 6. Checkpoint time and data size by 40-MHz

superSPARC, Solaris2.4, and 100 Mbit/s Ethernet

environment

Process/data

sizeNo Proc 1 Sleep Proc 7 Sleep Proc

CP#1 13 ms 17 ms 32 ms

CP#2 17 ms 52 ms 152 ms

CP transfer 267 ms 268 ms 468 ms

CP interval 500 ms 500 ms 500 ms

Data + stack 0 KB 37 KB 252 KB

Shared memory 0 KB 0 KB 0 KB

Files 0 KB 0 KB 0 KB

Table 5. Checkpoint time and data size by 40-MHz

SuperSPARC, Solaris 2.4, and 100 Mbit/s Ethernet

environment

Process/data

sizeNo Proc 1 Sleep Proc 7 Sleep Proc

CP#1 12 ms 17 ms 32 ms

CP#2 17 ms 49 ms 137 ms

CP transfer 263 ms 266 ms 327 ms

CP interval 500 ms 500 ms 500 ms

Data + stack 0 KB 37 KB 253 KB

Shared memory 0 KB 0 KB 0 KB

Files 0 KB 0 KB 0 KB

Table 7. Checkpoint time and data size by 50-MHz

SuperSPARC, Solaris 2.4, and 100 Mbit/s Ethernet

environment

Process/data size No Proc1 Sleep

Proc

7 Sleep

Proc

CP#1 8 ms 10 ms 12 ms

CP#2 12 ms 41 ms 128 ms

CP transfer 264 ms 272 ms 464 ms

CP interval 500 ms 500 ms 500 ms

Data + stack 0 KB 37 KB 256 KB

Shared memory 0 KB 0 kB 0 KB

Files 0 KB 0 KB 0 KB

Table 8. Checkpoint time and data size by UltraSPARC1,

Solaris 2.5.1, and 10 Mbit/s Ethernet environment

Process/data size No Proc 1 Sleep Proc 7 Sleep Proc

CP#1 4 ms 6 ms 13 ms

CP#2 13 ms 20 ms 71 ms

CP transfer 317 ms 364 ms 1056 ms

CP interval 500 ms 500 ms 500 ms

Data + stack 0 KB 58 KB 391 KB

Shared memory 0 KB 0 kB 0 KB

Files 0 KB 0 KB 0 KB

Table 9. Overhead issued in checkpoint processing and

normal processing

Process in CP processin normal

process

ARTEMIS daemons VO1 VO3

Jacket Routines VO2 VO4

20

We assume that the CPU usage is 100% when the

maximum throughput in a 40-MHz SuperSPARC is N

tpmC, and we calculate the maximum throughput when we

executed TPC-C under the control of ARTEMIS. A minute

is 60 u 1000 ms. If a checkpoint is taken in a second, FO is

194 u 60 ms. Usually, a transaction takes 60 u 1000/N ms

of CPU time. But in ARTEMIS, we consider 3.72 ms of VO.

So a transaction takes 60 u 1000/N + 3.72 ms under the

control of ARTEMIS. Thus, the maximum overhead under

the control of ARTEMIS is

Thus, if the maximum throughput in a 40-MHz Super-

SPARC is 200, 300, or 400 tpmC, the overhead of

ARTEMIS is as shown in Table 10.

According to Tables 5, 6, 7, and 8, the ratios of the

checkpoint time in other processors are

x 50-MHz SuperSPARC: 0.84

x UltraSPARC1: 0.47

According to the above values, FO and VO in the

50-MHz SuperSPARC are given by the following expres-

sion, and the overhead of ARTEMIS is shown in Table 11

if the maximum throughput is 300, 400, or 500 tpmC:

FO and VO in the UltraSPARC1 are given by the

following expression, and the overhead of ARTEMIS is

shown in Table 12 if the maximum throughput is 400, 600,

or 800 tpmC:

The data size transferred from the primary server

computer to the backup server computer is divided into the

following two parts.

x Fixed Transfer Data Size (FD):

Transfer data size which does not depend on the

number of transactions.

x Variable Transfer Data Size (VD):

Transfer data size which depends on the number of

transactions.

According to Tables 1 and 2, FD and VD transferred

from the primary server computer to the backup server

computer are

Thus, if the throughput is N tpmC, the transfer data

size is as follows when TPC-C benchmark is executed

under the control of ARTEMIS:

According to the above, if the throughput is 1000,

5000, and 10,000 tpmC, the transfer data size is as shown

in Table 13.

Table 13. Transfer data size from primary server

computer to backup server computer

Throughput Transfer data size

1000 tmpC 0.6 MB/S

5000 tmpC 1.6 MB/S

10,000 tmpC 2.8 MB/S

Table 12. Overhead of ARTEMIS in UltraSPARC1

Normal ARTEMIS Overhead

400 tmpC 359.41 tmpC 10.0%

600 tmpC 536.02 tmpC 11.0%

800 tmpC 710.62 tmpC 11.0%

Table 10. Overhead of ARTEMIS in 40-MHz

SuperSPARC

Normal ARTEMIS Overhead

200 tmpC 159.23 tmpC 20.4%

300 tmpC 238.38 tmpC 20.9%

400 tmpC 314.60 tmpC 21.0%

Table 11. Overhead of ARTEMIS in 50-MHz

SuperSPARC

Normal ARTEMIS Overhead

300 tmpC 247.24 tmpC 17.6%

400 tmpC 327.98 tmpC 18.0%

500 tmpC 407.89 tmpC 18.0%

21

5. Conclusions

The basic mechanism of fault recovery is the Distrib-

uted Checkpoint mechanism. Only processes which must

be reliable are executed under the control of ARTEMIS; the

processes are checkpointed every second. When a failure

occurs, all the processes which are executed under the

control of ARTEMIS are restarted from the checkpoint.

In ARTEMIS, we propose the Distributed Replica-

tion mechanism for building fault tolerant systems with the

Distributed Checkpoint mechanism. In the Distributed

Replication mechanism, a server computer is configured

with a primary server computer and a backup server com-

puter, and checkpoints taken in the primary server computer

are transferred to the backup server computer. The file

operations which are executed in the primary server com-

puter are also executed in the backup server computer, and

the processes which execute system calls, including file

operations, can be restarted both in the primary server

computer and in the backup server computer.

This paper presents the mechanism and the evalu-

ation of Distributed Replication. When application pro-

grams are executed under the control of ARTEMIS, the

system can be recovered from a failure, and it is not neces-

sary to modify the application programs. The overhead of

ARTEMIS is between 10% and 20% in TPC-C benchmark

tests.

REFERENCES

1. Eguchi K, Mori R. Overview of and trends in high-

availability system technologies. Toshiba Rev

1997;52:6�9. (in Japanese)

2. Mori R, Kobayashi S, Kaneko T, Hara S. PC server

cluster�toward higher availabili ty. IPSJ

1998;39:49�54. (in Japanese)

3. Nelson VP. Fault-tolerant computing: Fundamental

concepts. IEEE Comput 1990;23:19�25.

4. Siewiorek DP. Fault tolerance in commercial com-

puters. IEEE Comput 1990;23:26�37.

5. Shirakihara T, Hirayama H, Kanai T. Design and

implementation of ARTEMIS. Tech Rep IPSJ, 97-

OS-32, p 183�188, 1997. (in Japanese)

6. Hirayama H, Shirakihara T, Kanai T, Sato K. Fault

tolerant system with ARTEMIS. Tech Rep IEICE

1997;FTS97-19. (in Japanese)

7. Shirakihara T, Hirayama H, Kanai T, Sato K.

ARTEMIS: Advanced reliable distributed environ-

ment middleware system. Proc Int Conf PDPTA�97,

Vol. 1, p 97�106.

8. Manabe Y, Aoyagi S. Distributed checkpoint and

rollback algorithm. IPSJ 1993;34:1366�1374. (in

Japanese)

9. Hirayama H, Masubuchi Y, Hoshina S, Shimada T,

Kato N, Nozaki M. Design and evaluation of highly

reliable server with QRM. J IEICE 1997;J80-D-

I:916�927. (in Japanese)

10. Masubuchi Y, Hoshina S, Shimada T, Hirayama H,

Kato N. Fault recovery module for multiprocessor

servers. Proc Int Conf FTCS-27, p 184�193, 1997.

11. Plank JS, Beck M, Kingsley G, Li K. Libckpt: Trans-

parent checkpointing under UNIX. Proc USENIX

Winter 1995 Tech Conf, p 213�224.

12. Transaction Processing Performance Council. TPC-

C benchmark specification. http://www.tpc.org/

cspec.html/.

AUTHORS (from left to right)

Hideaki Hirayama (member) received his B.S. degree from Keio University in 1981. Since 1981 he has been a research

engineer for Toshiba Corporation. His research interests include OS and fault tolerance.

Toshio Shirakihara received his B.S. and M.S. degrees from Kyushu University in 1987 and 1989. Since 1989 he has

been a research engineer for Toshiba Corporation. His research interests include OS and distributed processing.

22

AUTHORS (continued)

Tatsunori Kanai (member) received his B.S. and M.S. degrees from Kyoto University in 1984 and 1986. Since 1989 he

has been a research engineer for Toshiba Corporation. His research interests include OS, database and programming language.

23