Upload
hideaki-hirayama
View
215
Download
0
Embed Size (px)
Citation preview
Distributed Replication Mechanism for Building Fault Tolerant
System with Distributed Checkpoint Mechanism
Hideaki Hirayama
Information & Communication Systems Laboratory, Toshiba Corporation, Ome, Japan 198-0025
Toshio Shirakihara
Communication and Information Systems Research Laboratories, Research and Development Center, Toshiba Corporation,
Kawasaki, Japan 210-0901
Tatsunori Kanai
Information & Communication Systems Laboratory, Toshiba Corporation, Ome, Japan 198-0025
SUMMARY
A distributed fault tolerant middleware system called
�ARTEMIS� (Advanced Reliable disTributed Environment
MIddleware System) was developed for the purpose of
building fault tolerant systems without modifying either the
source code or the binary code of application programs in
open systems. In ARTEMIS, we proposed the Distributed
Replication mechanism for building fault tolerant systems
with the Distributed Checkpoint mechanism. In the Distrib-
uted Replication mechanism, a server computer is config-
ured with a primary server computer and a backup server
computer. If a failure occurs, it recovers the system in a few
seconds to tens of seconds. Its overhead is 10% to 20% in
TPC-C Benchmark tests. © 2000 Scripta Technica, Syst
Comp Jpn, 31(5): 12�23, 2000
Key words: Checkpoint; rollback; reliability;
availability; fault tolerance.
1. Introduction
When we build an enterprise information system with
open systems, we must pay attention to its reliability. In
particular, a failure in a server computer may take a whole
system down. Therefore, in many mission-critical systems,
a server computer is configured as a High-Availability
system [1, 2] with a primary server and a backup server.
Thus, if the primary server computer goes down, its func-
tions must be taken over by the backup server computer. In
a database management system, journal recovery of the
database management system must be executed if the pri-
mary computer goes down and the backup server computer
takes over the processes which were being executed in the
primary server computer. The processes cannot be taken
over quickly because the journal recovery process takes a
few minutes to tens of minutes.
Fault tolerant computers [3, 4] have traditionally
been used for solving those problems. But it is impossible
to develop traditional fault tolerant computers in open
systems, because of cost and standardization considera-
© 2000 Scripta Technica
Systems and Computers in Japan, Vol. 31, No. 5, 2000Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J82-D-I, No. 3, March 1999, pp. 496�507
12
tions. We therefore proposed a middleware system
�ARTEMIS� (Advanced Reliable disTributed Environment
MIddleware System) [5�7] which builds a new type of fault
tolerant system suited to open systems. ARTEMIS takes
checkpoints of processes which need higher reliability. If a
failure occurs, ARTEMIS restarts the processes from the
checkpoints. This technique is known as the Distributed
Checkpoint mechanism [8].
ARTEMIS assumes that a High-Availability system
detects a failure and proposes the following mechanism for
building fault tolerant systems in open systems:
x the Distributed Checkpoint mechanism, which
supports not only message communications but
also interprocess communications
x the Distributed Replication mechanism, which
builds fault tolerant systems with duplicated serv-
er computers
x the Jacket Routine mechanism, which adds the
above mechanisms without updating the source
code or binary code of application programs
This paper presents both the mechanism and the
evaluation of Distributed Replication which builds fault
tolerant systems with duplicated server computers.
2. System Architecture of ARTEMIS
2.1. Checkpoint and rollback mechanism
The basic mechanism of ARTEMIS is a per-process
checkpoint and rollback mechanism, shown in Fig. 1. Proc-
esses which should be reliable are executed under the
control of ARTEMIS and checkpoints are taken for them
about every second. If a failure occurs, all the processes
which are executed under the control of ARTEMIS are
restarted from the last checkpoint. The mechanism shown
in Refs. 9 and 10 also makes the system reliable based on
the checkpoint and rollback mechanism. But since its
checkpoint and rollback mechanism is the per-system
checkpoint and rollback mechanism, the checkpoint has the
whole system image, including the OS.
The checkpoints in ARTEMIS have the following
states.
x the address space image and processor context of
the process
x the services provided by the OS to the process
The address space image is saved by the incremental
checkpoint mechanism [11]. In the mechanism, all the
pages of the address space including writable pages are
write-protected. If a page fault occurs against the pages,
ARTEMIS records the modified page numbers. When
ARTEMIS takes checkpoints, it saves only the modified
pages.
The services provided by the OS are saved as infor-
mation by hooking the system calls which are called by the
processes. If a failure occurs, the services provided by the
OS to the process are recovered by reissuing the system
calls based on the information. As an example in file opera-
tions, the files which are being opened when taking the
checkpoint are reopened with the same file descriptors and
the file pointers are set to the position at the time of the
checkpoint. How to recover the data in the files is presented
in Section 3.
Semaphores and shared memory segments cannot be
acquired with the same IDs. Therefore, ARTEMIS treats the
IDs of semaphores and shared memory segments that ap-
plication programs access as virtual IDs. ARTEMIS treats
the usual IDs as physical IDs. The virtual IDs and the
physical IDs have a one-to-one relation. If a failure occurs
and ARTEMIS restarts the processes, ARTEMIS sets the
new IDs acquired to the physical IDs. The application
programs access semaphores and shared memory segments
by the virtual IDs which are the same before the failure
occurs. ARTEMIS hooks the system calls which are called
by processes and converts the virtual IDs to the physical
IDs. Therefore, the IDs of semaphores and shared memory
segments are changed by reacquisition when ARTEMIS
restarts the processes, but the processes can use the same
IDs as were used before the failure occurs.
2.2. Jacket routine
ARTEMIS provides fault recovery functions based
on the checkpoint and rollback mechanism to application
programs without modifying the source code or binary code
of those programs. To provide such functions, ARTEMIS
Fig. 1. Principle of fault recovery mechanism of
ARTEMIS.
13
hooks the system calls which are called by the application
processes. We call this mechanism the �Jacket Routine.�
ARTEMIS has library functions which have the same
names as the system calls which are hooked. Those func-
tions collect the information for fault recovery and then call
the actual system functions. Those functions are linked
instead of the actual system functions. Thus, when applica-
tion processes make system calls, the jacket routines are
called; they collect information for fault recovery and then
call the actual system functions.
Figure 2(1) shows how the standard functions are
called normally. On the other hand, Fig. 2(2) shows how the
standard functions are called under the control of
ARTEMIS. As an example, standard functions are stored in
the library LIBC. The standard functions are subdivided
into the following three types from the viewpoint of access-
ing the kernel:
x SYS1
x LIB1
x LIB2
SYS1 includes system call functions which issue a
trap instruction to access the kernel. Open is an example of
this type of function. LIB1 includes simple functions which
execute internally and do not access the kernel. Strcmp is
an example of this type of function. LIB2 includes indirect
system call functions which call SYS1-type functions.
Fopen is an example of this type of functions.
Under the control of ARTEMIS, the jacket routine
library functions are linked instead of LIBC functions.
Thus, SYS1-type functions are referenced from the jacket
routine library, LIB1-type functions are referenced from the
standard library, and LIB2-type functions are referenced
from the standard library, but the functions indirectly called
are referenced from the jacket routine library.
2.3. Distributed checkpoint
In ARTEMIS, we extend the per-process checkpoint
and rollback mechanism to support the whole distributed
system. Therefore, ARTEMIS takes checkpoints of multi-
ple processes in multiple computers.
We now consider how to take checkpoints of multiple
processes which cooperate with each other with communi-
cation functions such as the socket interface. If multiple
processes cooperate with connection-oriented communica-
tion such as TCP, there is a problem in taking checkpoints.
We consider the following situations.
(1) A sender process is checkpointed just after it sends
data to a receiver process.
(2) The receiver process is checkpointed just before
it receives the data.
(3) A failure occurs and the processes are restarted
from the checkpoints.
In this case, the sender process had already sent the
data, but the receiver process had not received the data yet.
Thus, the data are lost.
Next we consider the following situation.
(1) A sender process is checkpointed just before it
sends data to a receiver process.
(2) The receiver process is checkpointed just after it
receives the data.
(3) A failure occurs and the processes are restarted
from the checkpoints.
In this case, the sender process has not sent the data,
but the receiver process had already received the data. The
sender process sends the data again and the receiver process
receives the same data again. Thus, the data are duplicated.
To solve the above problem, ARTEMIS takes check-
points of processes which cooperate with connection-ori-
ented communication such as TCP when the receiver
process has received all the data which were sent by the
sender process. Figure 3 shows the protocol. In the first
phase, processes are inhibited to send data and to tell the
amount of data which they have already sent to the receiver
processes. In the second phase, the receiver processes re-
ceive that amount of data and then all the processes take
checkpoints simultaneously. To arbitrate the processes,
there is a coordinate daemon in each computer.
On the other hand, in connectionless communication
such as UDP, loss of transmission data and duplicated
transmission are permitted so that it is not necessary to
coordinate the processes with the above protocol. The Dis-
tributed Checkpoint protocol is subdivided into coordinated
checkpoint and autonomous checkpoint types [8]. InFig. 2. Mechanism of jacket routine.
14
ARTEMIS, we have proposed two-phase protocol which is
one of the coordinated checkpoint protocols.
3. Distributed Replication Mechanism
3.1. Building fault tolerant system
The Distributed Checkpoint mechanism provides the
basic functions to recover processes from a failure. But it is
not able to increase availability in real-world systems. The
following failures exist in real-world computer systems:
x process abort
x OS panic
x hardware failure
The traditional Distributed Checkpoint mechanism
treats only �process abort.� Of course, if checkpoints are
saved in nonvolatile memory, when the system crashes
owing to an OS panic or hardware failure, it can restart
processes after the system is rebooted. But it cannot recover
the system immediately and thus it is not possible to build
fault-tolerant systems.
In ARTEMIS, we propose the Distributed Check-
point mechanism for the purpose of building fault-tolerant
systems with the Distributed Checkpoint mechanism. In
that mechanism, a server computer is configured with a
primary server computer and a backup server computer.
ARTEMIS sends checkpoints taken in the primary server
computer to the backup server computer and updates files
in the primary server computer as well as files in the backup
server computers. Thus, processes can be restarted in both
the primary server computer and the backup server com-
puter.
This Distributed Replication mechanism is not pro-
vided by other systems.
3.2. Mechanism of distributed replication
Figure 4 shows the basic mechanism of Distributed
Replication. In the Distributed Replication mechanism,
ARTEMIS hooks file update operations executed by the
processes in primary servers and collects the file update
information (1). For the write operation, the file update
information includes the following:
x file descriptor
x file seek pointer
x data length
x write data
The file update information is buffered in the primary
server computer and the data in the buffer are sent to the
backup server computer when the buffer is filled with data
(2). The file update information sent to the backup server
computer is just linked to the �nondeterministic queue�
without executing the file operations based on the file
update information in the backup server computer. When a
checkpoint is taken, the file update information buffered in
the primary server computer is flushed to the backup server
computer. In the backup server computer, the file update
information linked to the nondeterministic queue is moved
to the �deterministic queue� (3). The file operations based
on the file update information linked to the deterministic
queue are executed in the backup server computer after the
next checkpoint is taken (4). Therefore, under the control
of the Distributed Replication mechanism, the file update
operations executed in the primary server computers are
executed in the backup server computers after the check-
point is taken.
Fig. 3. Protocol of distributed checkpoint.
Fig. 4. Mechanism of distributed replication.
15
Figure 5 shows how files in the backup server com-
puter are updated identically to those in the primary server
computer. First, file update operations �A,� �B,� and �C�
are executed in the primary server computer. Then check-
point �CP1� is taken and file update information �A,� �B,�
and �C� is sent to the backup server computer to be held.
After checkpoint �CP1,� the file operations based on file
update information �A,� �B,� and �C� are executed in the
backup server computer. Then file update operations �D�
and �E� are executed in the primary server computer. Then
checkpoint �CP2� is taken and file update information �D�
and �E� is sent to the backup server computer to be held.
After the checkpoint �CP2,� the file operations based on
file update information �D� and �E� are executed in the
backup server computer.
3.3. Recovery from process abort
In ARTEMIS, if a process running in the primary
server computer aborts, all processes are restarted from the
checkpoint in the same computers. That is, processes which
are running in the primary server computer are restarted in
the same primary server computer. Thus, before restarting
the processes, it is necessary to roll back the files to the
states before the checkpoint in both the primary server
computer and the backup server computer.
Figure 6 shows the mechanism used to roll back files
to the checkpoint in both the primary server computer and
the backup server computer when a process aborts. When
a process running in the primary server computer aborts,
ARTEMIS terminates all processes and flushes the file
update information remaining in the buffer of the primary
server computer to the backup server computer (1). In the
backup server computer, ARTEMIS executes the file opera-
tions based on the file update information linked to the
deterministic queue and updates the files in the backup
server computer (2), because the file update information
linked to the deterministic queue is based on the operations
executed before the checkpoint. On the other hand,
ARTEMIS does not execute file operations based on the file
update information linked to the nondeterministic queue,
because the file update information linked to the nondeter-
ministic queue is based on operations executed after the
checkpoint. By executing the above operations, the files in
the backup server computer are rolled back to the check-
point.
But the files in the primary server computer were
updated after the checkpoint. The file update information is
linked to the nondeterministic queue in the backup server
computer and the �before� image data exist in the files in
the backup server computer. Thus, ARTEMIS builds file
recovery information from the file update information
linked to the nondeterministic queue in the backup server
computer and the �before� image data in the files in the
backup server computer (3). As an example, if the operation
is a file write, ARTEMIS reads the �before� image data
from the files in the backup server instead of executing the
file update operations, then sends the file recovery informa-
tion to the primary server computer, and executes the file
operations based on the file recovery information (4). Thus,
the files in the primary server computer are restored to the
states at the checkpoint.
Figure 7 shows how the files in the primary and
backup server computers are restored to the states of the
checkpoint by the Distributed Checkpoint mechanism
when a process aborts. First, file update operations such as
A, B, and C are executed in the primary server computer.
Then, checkpoint CP1 is taken. Up to this time, the file
update information A, B, and C is sent to the backup server
Fig. 5. File update under control of distributed
replication mechanism.
Fig. 6. Recovery mechanism from process abort.
16
computer. After the checkpoint, file update operations A, B,
and C are executed in the backup server computer. On the
other hand, file update operations D and E are executed in
the primary server computer. After that, a process which
was being executed in the primary server computer aborts.
In this case, all processes are restarted from the checkpoint
in the same server computers. At this time, if file update
information D and E remains in the primary server com-
puter, it is flushed to the backup server computer. Then, file
update operations A, B, and C before the checkpoint are
executed in the backup server computer. And the �before�
image data are read from the files in the backup server
computer based on the file update information linked to the
nondeterministic queue. Then the files in the primary server
computer are recovered from the file update information.
3.4. Recovery from system crash
If the primary server computer goes down as a result
of an OS panic or hardware failure, ARTEMIS restarts the
processes which were running in the primary server com-
puter from the checkpoint in the backup server computer.
At this time, before the processes are restarted, the files in
the backup server computer are restored to the states at the
checkpoint.
Figure 8 shows how ARTEMIS restores the files in
the backup server computer to the states at the checkpoint
when the primary server computer goes down. When the
primary server computer goes down, the processes which
were running in the primary server computer are restarted
in the backup server computer. Before the processes are
restarted, the files in the backup server computer are re-
stored to the states at the checkpoint. ARTEMIS executes
the file operations based on the file update information
linked to the deterministic queue in the backup server
computer (1), because the file update information is based
on file operations which were executed before the check-
point. On the other hand, the file update information linked
to the nondeterministic queue is deleted instead of execut-
ing it (2), because the file update information is based on
file operations which were executed after the checkpoint.
Thus, the files in the backup server computer are restored
to the states of the checkpoint, if the primary server com-
puter goes down.
Figure 9 shows how the files in the backup server
computer are restored to the states at the checkpoint by the
Distributed Replication mechanism when the primary serv-
er goes down. First, the file update operations A, B, and C
are executed in the primary server.
Then, checkpoint CP1 is taken and file update infor-
mation A, B, and C is sent to the backup server. After
checkpoint CP1, the file update operations based on file
update information A, B, and C are executed in the backup
server. On the other hand, file update operations D and E
are executed in the primary server computer. After that, the
primary server goes down. In this case, the processes which
Fig. 7. File recovery from process abort.
Fig. 8. Recovery mechanism from system abort.
Fig. 9. File recovery from system down.
17
were running in the primary server computer are restarted
in the backup server computer. Before restarting the proc-
esses, the file update operations based on file update infor-
mation A, B, and C linked to the deterministic queue which
were executed before checkpoint CP1 in the primary server
computer are executed in the backup server computer, and
file update information D and E linked to the nondetermin-
istic queue is deleted instead of being executed. In this way,
the Distributed Replication mechanism hooks the file up-
date operations executed in the primary server computer
and gets the file update information. And based on the file
update information, it updates files in the backup server
computer identically to those in the primary server com-
puter, and restores the files in the backup server computer
to the states at the checkpoint if the primary server computer
goes down. Until now, we have discussed only write opera-
tions for the file update operations, but actually other file
update operations, such as file creation, deletion, opening,
closing, and others, are treated in the same way.
3.5. Reduplication
When the primary server computer goes down and
the backup server computer takes over processing, the
server computer is configured as nonduplicated. In the
nonduplicated configuration, ARTEMIS cannot recover the
system from a further failure. The Distributed Replication
mechanism provides a function to build a new backup
server computer into the system after it is rebooted. For
configuring the system as a reduplicated system again,
ARTEMIS saves the information about the OS services
after the primary server computer goes down and the
backup server computer takes over the processing. But the
file update information is not saved because it is too large
to hold. When ARTEMIS starts reduplication, ARTEMIS
restarts to save the file update information, then copies the
following files from the primary server to the backup serv-
er:
x files which are being opened in write mode
x files which were opened but now are closed
When the copying process is completed, ARTEMIS
takes a checkpoint. If the files are updated while they are
being copied, there is no problem, because that file update
information is saved and transferred to the backup server
computer at the first checkpoint after the reduplication, and
the file operations based on the file update information are
executed in the backup server computer after the next
checkpoint. Thus, it is possible to reduplicate the server
computer without stopping the system.
4. Implementation and Evaluation
4.1. Implementation
ARTEMIS has been implemented on Solaris, and
many application programs, such as Oracle DBMS, have
been run under the control of ARTEMIS. We determined
that the recovery functions were correctly executed. When
we set the checkpoint interval to 500 ms, it takes two or
three seconds to restart the processes from the checkpoint.
If we consider the mechanism, it takes a longer time to
recover the system from a failure caused by a process abort
than to recover from a failure caused by a system down, but
there is no visible difference; the reason is that we cannot
measure the difference.
Actually, it takes a longer time to detect the system
down than to recover the system from the system down,
because a process abort is detected quickly, but it takes a
few seconds to a few tens of seconds to detect a system
down. Thus, ARTEMIS can recover Oracle DBMS in a few
seconds, but the total recovery time including the time to
detect the system down ranges from a few seconds to a few
tens of seconds.
Figure 10 shows a three-tier application program
executed under the control of ARTEMIS. In each computer,
a coordination daemon is running to coordinate the taking
checkpoints with other computers. In the server computers,
the Distributed Replication daemons are running to repli-
cate the files between the primary server computer and the
backup server computer and to recover the files when a
failure occurs.
Fig. 10. Three-tier client�server-type application
program under control of ARTEMIS.
18
4.2. Measurement
ARTEMIS provides fault recovery without
modifying application programs. But there is an over-
head when application programs are executed under
the control of ARTEMIS. We evaluated the overhead
with the TPC-C benchmark [12].
We ran Oracle DBMS in server computers con-
sisting of 40-MHz SuperSPARCs and Solaris 2.4. The
primary server computer and the backup server com-
puter were connected via a 100-Mbps Ethernet. The
checkpoint interval was 500 ms. Table 1 shows the
checkpoint time and the data size.
The first checkpoint phase takes 38 ms, the sec-
ond checkpoint phase takes 212 ms, and the check-
point transfer time is 343 ms. The first checkpoint
phase and the checkpoint transfer are executed concur-
rently with normal processing. But the second check-
point phase cannot be executed simultaneously with
the normal process. It takes 500 ms after the comple-
tion of checkpoint transfer before the start of the next
checkpoint.
In the checkpoint transfer, ARTEMIS transfers the
checkpoints and the file update information from the pri-
mary server computer to the backup server computer. The
transfer data size of the data and stack is 325 KB, the
transfer data size of the shared memory is 24 KB, and the
transfer data size of the file update information is less than
1 KB, so that the total data size is 349 KB.
Table 2 shows the checkpoint time and data size per
tpmC (TPC-C Transactions Per Minute) for executing TPC-
C benchmark.
In the TPC-C benchmark, when the load increases by
1 tpmC, the first checkpoint phase time increases by 0.2 ms,
the second checkpoint phase time increases by 3.2 ms, the
checkpoint transfer time increases by 1.3 ms, the total
checkpoint time increasing by 4.7 ms; the transfer data size
of data and stack increases by 2.3 KB, the transfer data size
of shared memory increases by 10.6 KB, the transfer data
size of file update information increases by 1.8 KB, the total
transfer data size increasing by 14.7 KB.
The overhead of ARTEMIS is divided into two
parts. The first is the CPU time for the second checkpoint
phase, and the second is the CPU time to record the
update pages and to get the file update information. They
are executed concurrently with normal processing. Table
3 shows the ratio of the used time in checkpoint process-
ing and in normal processing.
Table 4 shows the ratio of the time used by the
ARTEMIS daemons in the second checkpoint phase and
the time used by the jacket routines.
Tables 5, 6, 7, and 8 show the checkpoint time and
the data size in the following environments, respectively:
x 40-MHz SuperSPARC, Solaris 2.4, 100-Mbps
Ethernet
x 40-MHz SuperSPARC, Solaris 2.4, 10-Mbps Eth-
ernet
x 50-MHz SuperSPARC, Solaris 2.4, 10-Mbps Eth-
ernet
x UltraSPARC, Solaris 2.5.1, 10-Mbps Ethernet
The following processes are executed, respectively:
Table 1. Checkpoint time and data size of oracle
Process/data size Time
CP#1 38 ms
CP#2 212 ms
CP transfer 343 ms
CP interval 500 ms
Data + stack 325 KB
Shared memory 24 KB
Files 0 KB
Table 2. Checkpoint time and data size
necessary for 1 tpmC
Process/data size Time
CP#1 0.2 ms
CP#2 3.2 ms
CP transfer 1.3 ms
Data + stack 2.3 KB
Shared memory 10.6 KB
Files 1.8 KB
Table 3. Ratio of used time in checkpoint processing
and normal processing
Processin CP
process
in normal
process
ARTEMIS daemon 95% 5%
Jacket routines 84% 16%
Table 4. Ratio of checkpoint time by ARTEMIS
daemons and jacket routines
Process Ratio
Ratio of ARTEMIS daemon in CP 22%
Ratio of Jacket Routine in CP 78%
19
x No Proc: no processes are executed
x 1 Sleep Proc: one sleep process is executed
x 7 Sleep Proc: seven sleep processes are executed
4.3. Evaluation
The overhead of ARTEMIS in TPC-C benchmark
is divided into the following two parts.
x Fixed Overhead (FO):
The overhead does not depend on the number of
transactions.
x Variable Overhead (VO):
The overhead depends on the number of transac-
tions. VO is the total of VO1, VO2, VO3, and VO4 shown
in Table 9.
FO and VO in the 40-MHz SuperSPARC are de-
fined by the following expression according to Tables 1,
2, 3, and 4:
Table 6. Checkpoint time and data size by 40-MHz
superSPARC, Solaris2.4, and 100 Mbit/s Ethernet
environment
Process/data
sizeNo Proc 1 Sleep Proc 7 Sleep Proc
CP#1 13 ms 17 ms 32 ms
CP#2 17 ms 52 ms 152 ms
CP transfer 267 ms 268 ms 468 ms
CP interval 500 ms 500 ms 500 ms
Data + stack 0 KB 37 KB 252 KB
Shared memory 0 KB 0 KB 0 KB
Files 0 KB 0 KB 0 KB
Table 5. Checkpoint time and data size by 40-MHz
SuperSPARC, Solaris 2.4, and 100 Mbit/s Ethernet
environment
Process/data
sizeNo Proc 1 Sleep Proc 7 Sleep Proc
CP#1 12 ms 17 ms 32 ms
CP#2 17 ms 49 ms 137 ms
CP transfer 263 ms 266 ms 327 ms
CP interval 500 ms 500 ms 500 ms
Data + stack 0 KB 37 KB 253 KB
Shared memory 0 KB 0 KB 0 KB
Files 0 KB 0 KB 0 KB
Table 7. Checkpoint time and data size by 50-MHz
SuperSPARC, Solaris 2.4, and 100 Mbit/s Ethernet
environment
Process/data size No Proc1 Sleep
Proc
7 Sleep
Proc
CP#1 8 ms 10 ms 12 ms
CP#2 12 ms 41 ms 128 ms
CP transfer 264 ms 272 ms 464 ms
CP interval 500 ms 500 ms 500 ms
Data + stack 0 KB 37 KB 256 KB
Shared memory 0 KB 0 kB 0 KB
Files 0 KB 0 KB 0 KB
Table 8. Checkpoint time and data size by UltraSPARC1,
Solaris 2.5.1, and 10 Mbit/s Ethernet environment
Process/data size No Proc 1 Sleep Proc 7 Sleep Proc
CP#1 4 ms 6 ms 13 ms
CP#2 13 ms 20 ms 71 ms
CP transfer 317 ms 364 ms 1056 ms
CP interval 500 ms 500 ms 500 ms
Data + stack 0 KB 58 KB 391 KB
Shared memory 0 KB 0 kB 0 KB
Files 0 KB 0 KB 0 KB
Table 9. Overhead issued in checkpoint processing and
normal processing
Process in CP processin normal
process
ARTEMIS daemons VO1 VO3
Jacket Routines VO2 VO4
20
We assume that the CPU usage is 100% when the
maximum throughput in a 40-MHz SuperSPARC is N
tpmC, and we calculate the maximum throughput when we
executed TPC-C under the control of ARTEMIS. A minute
is 60 u 1000 ms. If a checkpoint is taken in a second, FO is
194 u 60 ms. Usually, a transaction takes 60 u 1000/N ms
of CPU time. But in ARTEMIS, we consider 3.72 ms of VO.
So a transaction takes 60 u 1000/N + 3.72 ms under the
control of ARTEMIS. Thus, the maximum overhead under
the control of ARTEMIS is
Thus, if the maximum throughput in a 40-MHz Super-
SPARC is 200, 300, or 400 tpmC, the overhead of
ARTEMIS is as shown in Table 10.
According to Tables 5, 6, 7, and 8, the ratios of the
checkpoint time in other processors are
x 50-MHz SuperSPARC: 0.84
x UltraSPARC1: 0.47
According to the above values, FO and VO in the
50-MHz SuperSPARC are given by the following expres-
sion, and the overhead of ARTEMIS is shown in Table 11
if the maximum throughput is 300, 400, or 500 tpmC:
FO and VO in the UltraSPARC1 are given by the
following expression, and the overhead of ARTEMIS is
shown in Table 12 if the maximum throughput is 400, 600,
or 800 tpmC:
The data size transferred from the primary server
computer to the backup server computer is divided into the
following two parts.
x Fixed Transfer Data Size (FD):
Transfer data size which does not depend on the
number of transactions.
x Variable Transfer Data Size (VD):
Transfer data size which depends on the number of
transactions.
According to Tables 1 and 2, FD and VD transferred
from the primary server computer to the backup server
computer are
Thus, if the throughput is N tpmC, the transfer data
size is as follows when TPC-C benchmark is executed
under the control of ARTEMIS:
According to the above, if the throughput is 1000,
5000, and 10,000 tpmC, the transfer data size is as shown
in Table 13.
Table 13. Transfer data size from primary server
computer to backup server computer
Throughput Transfer data size
1000 tmpC 0.6 MB/S
5000 tmpC 1.6 MB/S
10,000 tmpC 2.8 MB/S
Table 12. Overhead of ARTEMIS in UltraSPARC1
Normal ARTEMIS Overhead
400 tmpC 359.41 tmpC 10.0%
600 tmpC 536.02 tmpC 11.0%
800 tmpC 710.62 tmpC 11.0%
Table 10. Overhead of ARTEMIS in 40-MHz
SuperSPARC
Normal ARTEMIS Overhead
200 tmpC 159.23 tmpC 20.4%
300 tmpC 238.38 tmpC 20.9%
400 tmpC 314.60 tmpC 21.0%
Table 11. Overhead of ARTEMIS in 50-MHz
SuperSPARC
Normal ARTEMIS Overhead
300 tmpC 247.24 tmpC 17.6%
400 tmpC 327.98 tmpC 18.0%
500 tmpC 407.89 tmpC 18.0%
21
5. Conclusions
The basic mechanism of fault recovery is the Distrib-
uted Checkpoint mechanism. Only processes which must
be reliable are executed under the control of ARTEMIS; the
processes are checkpointed every second. When a failure
occurs, all the processes which are executed under the
control of ARTEMIS are restarted from the checkpoint.
In ARTEMIS, we propose the Distributed Replica-
tion mechanism for building fault tolerant systems with the
Distributed Checkpoint mechanism. In the Distributed
Replication mechanism, a server computer is configured
with a primary server computer and a backup server com-
puter, and checkpoints taken in the primary server computer
are transferred to the backup server computer. The file
operations which are executed in the primary server com-
puter are also executed in the backup server computer, and
the processes which execute system calls, including file
operations, can be restarted both in the primary server
computer and in the backup server computer.
This paper presents the mechanism and the evalu-
ation of Distributed Replication. When application pro-
grams are executed under the control of ARTEMIS, the
system can be recovered from a failure, and it is not neces-
sary to modify the application programs. The overhead of
ARTEMIS is between 10% and 20% in TPC-C benchmark
tests.
REFERENCES
1. Eguchi K, Mori R. Overview of and trends in high-
availability system technologies. Toshiba Rev
1997;52:6�9. (in Japanese)
2. Mori R, Kobayashi S, Kaneko T, Hara S. PC server
cluster�toward higher availabili ty. IPSJ
1998;39:49�54. (in Japanese)
3. Nelson VP. Fault-tolerant computing: Fundamental
concepts. IEEE Comput 1990;23:19�25.
4. Siewiorek DP. Fault tolerance in commercial com-
puters. IEEE Comput 1990;23:26�37.
5. Shirakihara T, Hirayama H, Kanai T. Design and
implementation of ARTEMIS. Tech Rep IPSJ, 97-
OS-32, p 183�188, 1997. (in Japanese)
6. Hirayama H, Shirakihara T, Kanai T, Sato K. Fault
tolerant system with ARTEMIS. Tech Rep IEICE
1997;FTS97-19. (in Japanese)
7. Shirakihara T, Hirayama H, Kanai T, Sato K.
ARTEMIS: Advanced reliable distributed environ-
ment middleware system. Proc Int Conf PDPTA�97,
Vol. 1, p 97�106.
8. Manabe Y, Aoyagi S. Distributed checkpoint and
rollback algorithm. IPSJ 1993;34:1366�1374. (in
Japanese)
9. Hirayama H, Masubuchi Y, Hoshina S, Shimada T,
Kato N, Nozaki M. Design and evaluation of highly
reliable server with QRM. J IEICE 1997;J80-D-
I:916�927. (in Japanese)
10. Masubuchi Y, Hoshina S, Shimada T, Hirayama H,
Kato N. Fault recovery module for multiprocessor
servers. Proc Int Conf FTCS-27, p 184�193, 1997.
11. Plank JS, Beck M, Kingsley G, Li K. Libckpt: Trans-
parent checkpointing under UNIX. Proc USENIX
Winter 1995 Tech Conf, p 213�224.
12. Transaction Processing Performance Council. TPC-
C benchmark specification. http://www.tpc.org/
cspec.html/.
AUTHORS (from left to right)
Hideaki Hirayama (member) received his B.S. degree from Keio University in 1981. Since 1981 he has been a research
engineer for Toshiba Corporation. His research interests include OS and fault tolerance.
Toshio Shirakihara received his B.S. and M.S. degrees from Kyushu University in 1987 and 1989. Since 1989 he has
been a research engineer for Toshiba Corporation. His research interests include OS and distributed processing.
22