8
A dual process redundancy approach to transient fault tolerance for ccNUMA architecture Xingjun Zhang a , Endong Wang b , Feilong Tang c , Meishun Yang a , Hengyi Wei a , Xiaoshe Dong a,n a Department of Computer Science & Technology, Xi'an Jiaotong University, Xi'an, China b Inspur (Beijing) Electronic Information Industry Co. Ltd, Beijing, China c Department of Computer Science & Engineering, Shanghai Jiaotong University, Shanghai, China article info Article history: Received 30 September 2012 Received in revised form 4 January 2013 Accepted 6 January 2013 Available online 14 June 2013 Keywords: Transient fault CcNUMA Dual-process Redundancy abstract Transient fault is a critical concern in the reliability of microprocessor system. The software fault tolerance is more exible and lower in cost than the hardware fault tolerance. And also, as architectural trends point toward multicore designs, there is substantial interest in adapting parallel and redundancy hardware resources for transient fault tolerance. The paper proposes a process-level fault tolerance technique, a software-centric approach, which efciently schedules and synchronizes redundancy processes with ccNUMA processors redundancy. So it can improve efciency of redundancy processes running and reduce time and space overhead. The paper focuses on the researching of redundancy processes error detection and handling method. A real prototype is implemented that is designed to be transparent to the application. The test results show that the system can timely detect soft errors of CPU and memory that cause the redundancy processes exception, and meanwhile ensure that the services of the application are uninterrupted and delayed shortly. & 2013 Elsevier B.V. All rights reserved. 1. Introduction A transient fault occurs when an event (e.g., cosmic particle strikes, power supply noise, device coupling) causes the deposit or removal of enough charge to invert the state of a transistor [1]. The inverted value may propagate to cause a soft error in computer system. Existing researches [2,3] indicate that the future error rate of a single transistor will remain relatively constant. As the number of available transistors per chip continues to grow exponentially, the error rate for an entire chip is expected to increase dramatically [1]. These trends indicate that, to ensure correct operation of systems, microprocessors and memories must employ reliability techniques. Fault tolerance is a key technology of improving reliability and availability of computer system. In the specic reliability-needed computing environments, traditionally, the hardware fault toler- ance techniques are used to tackle the transient faults. The hardware implementations have added 2030% additional logic to add redundancy to mainframe processors and cover upward of 200,000 latches [4,5]. And also, the design and verication of new redundant hardware are costly and may not be feasible in cost- sensitive requirements. In recent years, as one of the main factors that affects OS and critical applications, soft errors are emerging as a critical concern. Software-based fault tolerance techniques are an attractive solu- tion for improving the system reliability. While software techni- ques cannot provide a level of reliability comparable to hardware techniques, they signicantly lower costs (zero hardware design cost) and are very exible in deployment [1]. Inspur K2 server (see Fig. 1) is based on ccNUMA (Cache Coherent Non-Uniform Memory Architecture) architecture with Inspur K-UX, in which kernel is Linux 2.6.28.10 and has gotten UNIX03 certication. The server contains four nodes and each of them contains four Itanium CPUs, a 256 GB memory, a NC (Node Controller) and an I/O Controller. NC helps to connect a node and NR (Node Router) using QPI (QuickPath Interconnect) and then nodes can communicate with each other through NR. Further- more, two monitor boards, based on Linux and ARM, connect to all nodes to perform out-of-band monitor and control. Based on the software transient fault tolerant techniques, it is a promising way to get transient fault tolerance support by using the ccNUMA redundant resources. Existing software transient fault tolerant approaches use the compiler to insert redundant instructions for checking computa- tion [6], control ow [7], or both [8]. The execution of the inserted instructions and assertions decreases performance. Also, a compiler approach requires recompilation of all applications. It is inconvenient to recompile all applications and libraries, and the source code for legacy programs is often unavailable. Alex presented a process-level Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2013.01.043 n Corresponding author. Tel.: +86 29 82663951. E-mail addresses: [email protected] (X. Zhang), [email protected] (X. Dong). Neurocomputing 122 (2013) 5057

A dual process redundancy approach to transient fault tolerance for ccNUMA architecture

  • Upload
    xiaoshe

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Neurocomputing 122 (2013) 50–57

Contents lists available at ScienceDirect

Neurocomputing

0925-23http://d

n CorrE-m

xsdong@

journal homepage: www.elsevier.com/locate/neucom

A dual process redundancy approach to transient faulttolerance for ccNUMA architecture

Xingjun Zhang a, Endong Wang b, Feilong Tang c, Meishun Yang a,Hengyi Wei a, Xiaoshe Dong a,n

a Department of Computer Science & Technology, Xi'an Jiaotong University, Xi'an, Chinab Inspur (Beijing) Electronic Information Industry Co. Ltd, Beijing, Chinac Department of Computer Science & Engineering, Shanghai Jiaotong University, Shanghai, China

a r t i c l e i n f o

Article history:Received 30 September 2012Received in revised form4 January 2013Accepted 6 January 2013Available online 14 June 2013

Keywords:Transient faultCcNUMADual-processRedundancy

12/$ - see front matter & 2013 Elsevier B.V. Ax.doi.org/10.1016/j.neucom.2013.01.043

esponding author. Tel.: +86 29 82663951.ail addresses: [email protected] (X. Zhmail.xjtu.edu.cn (X. Dong).

a b s t r a c t

Transient fault is a critical concern in the reliability of microprocessor system. The software faulttolerance is more flexible and lower in cost than the hardware fault tolerance. And also, as architecturaltrends point toward multicore designs, there is substantial interest in adapting parallel and redundancyhardware resources for transient fault tolerance. The paper proposes a process-level fault tolerancetechnique, a software-centric approach, which efficiently schedules and synchronizes redundancyprocesses with ccNUMA processors redundancy. So it can improve efficiency of redundancy processesrunning and reduce time and space overhead. The paper focuses on the researching of redundancyprocesses error detection and handling method. A real prototype is implemented that is designed to betransparent to the application. The test results show that the system can timely detect soft errors of CPUand memory that cause the redundancy processes exception, and meanwhile ensure that the services ofthe application are uninterrupted and delayed shortly.

& 2013 Elsevier B.V. All rights reserved.

1. Introduction

A transient fault occurs when an event (e.g., cosmic particlestrikes, power supply noise, device coupling) causes the deposit orremoval of enough charge to invert the state of a transistor [1]. Theinverted value may propagate to cause a soft error in computersystem. Existing researches [2,3] indicate that the future error rateof a single transistor will remain relatively constant. As thenumber of available transistors per chip continues to growexponentially, the error rate for an entire chip is expected toincrease dramatically [1]. These trends indicate that, to ensurecorrect operation of systems, microprocessors and memories mustemploy reliability techniques.

Fault tolerance is a key technology of improving reliability andavailability of computer system. In the specific reliability-neededcomputing environments, traditionally, the hardware fault toler-ance techniques are used to tackle the transient faults. Thehardware implementations have added 20–30% additional logicto add redundancy to mainframe processors and cover upward of200,000 latches [4,5]. And also, the design and verification of newredundant hardware are costly and may not be feasible in cost-sensitive requirements.

ll rights reserved.

ang),

In recent years, as one of the main factors that affects OS andcritical applications, soft errors are emerging as a critical concern.Software-based fault tolerance techniques are an attractive solu-tion for improving the system reliability. While software techni-ques cannot provide a level of reliability comparable to hardwaretechniques, they significantly lower costs (zero hardware designcost) and are very flexible in deployment [1].

Inspur K2 server (see Fig. 1) is based on ccNUMA (CacheCoherent Non-Uniform Memory Architecture) architecture withInspur K-UX, in which kernel is Linux 2.6.28.10 and has gottenUNIX03 certification. The server contains four nodes and each ofthem contains four Itanium CPUs, a 256 GB memory, a NC (NodeController) and an I/O Controller. NC helps to connect a node andNR (Node Router) using QPI (QuickPath Interconnect) and thennodes can communicate with each other through NR. Further-more, two monitor boards, based on Linux and ARM, connect to allnodes to perform out-of-band monitor and control. Based on thesoftware transient fault tolerant techniques, it is a promising wayto get transient fault tolerance support by using the ccNUMAredundant resources.

Existing software transient fault tolerant approaches use thecompiler to insert redundant instructions for checking computa-tion [6], control flow [7], or both [8]. The execution of the insertedinstructions and assertions decreases performance. Also, a compilerapproach requires recompilation of all applications. It is inconvenientto recompile all applications and libraries, and the source code forlegacy programs is often unavailable. Alex presented a process-level

Fig. 1. K2 server architecture.

X. Zhang et al. / Neurocomputing 122 (2013) 50–57 51

redundancy (PLR) [1], a software-implemented technique for tran-sient fault tolerance, which is much related to our work. But PLR isbased on SMP and cannot apply on K2 server directly. And also, thereis no real application validation with PLR.

The paper presents a novel software transient tolerantmethod, dual process redundancy approach. This method isbased on the practical, relatively high scalability ccNUMA archi-tecture with redundant hardware through the transformation ofgeneral Linux kernel, embedded fault tolerance, build spatialredundancy. The system builds dual redundancy for applicationprocess and makes the pair of fault-tolerant process run ondifferent processors, and ensures the implementation of fault-tolerant process according to the same execution logic by usingsynchronization; finally, system will detect error by comparingwith the consistency of state and data during fault-tolerantprocess running, then handle error in time and finish recoveryrapidly by this means, to guarantee that the system is stillavailable when the key processes error. Compared with thetraditional fault-tolerant mode, the proposed system-level soft-ware fault-tolerant approach will mainly place fault-tolerantfunctions in the operating system kernel, it is not avoiding thecomplexity of the hardware customization, while being transpar-ent to the application.

This paper is built on the previous work [9], and the differencesare: (i) the software-based transient fault tolerance system forccNUMA server is given in detail; (ii) combining with relatedworks analysis, the methodology and advantages of the paper arehighlighted; (iii) a synchronization mechanism for shared memoryaccess in the dual process redundancy system is proposed; (iv) aCBR-based fault detection and error recovery mechanism arepresented; and (v) more experiments are given to validate thesystem effectiveness.

This paper makes the following contributions:

Presents a dual process redundancy approach to transient faulttolerance for ccNUMA K2 server. It is a software-based para-digm that uses the redundant hardware resources to get faulttolerance function.

Constructs a practical prototype system that operates transpar-ently to the specified applications. The system is validatedusing bank transaction processing application.

An error detection and recovery mechanism, which is justbased on dual modular process redundancy but not triple-modular, is proposed.

CBR-based fault detection and error recovery mechanism isgiven to ease the overhead of software-based fault tolerance.

The rest of this paper is organized as follows. Section 2 providesthe approach of dual process redundancy fault tolerance. Section 3

describes the synchronization mechanism of the redundantprocesses, and Section 4 describes the design and implementation

of error detection and handling. Section 5 shows results fromthe experiments and the real applications. Section 6 concludesthis paper.

2. Related works

Previous work on software-based fault tolerance includescontrol flow checking [10,7], algorithm based fault tolerance [11],heuristic rules [12,13], assertions [14], and process and instructionduplication [1,6,8] and source-to-source translation [15]. Thesemethods range from specialized to general. A major obstacle tosoftware-based redundancy is its high overhead in terms of codesize and execution time. These methods nearly require around100% performance overhead. Software-based redundancy techni-ques such as PLR [1], SWIFT [8], or SRMT [16] typically increaseprocessor utilization. The problem is further complicated if a triplemodular redundancy is used to diagnose the error position. Thepaper solves this problem by using these methods: (i) the systemdoes not need using triple modular to aid the error handling, theoriginal protected process and the peer redundant process aredeployed in different CPU nodes; when the comparison techniquedetects the error, the transaction processing of the applicationlayer will guarantee that the application is running properly and(ii) a CBR-base fault diagnosing is presented to assist the system tofind CPU and memory fault.

3. Background

3.1. Failure, fault, and error

When developing a concrete fault tolerance system, theterms failure, fault, and error have different meanings. Failuredenotes an element's inability to perform its designed functionbecause of errors in the element or its environment, which inturn are caused by various faults [17]. A fault is an anomalousphysical condition. Causes include design errors, such as mis-takes in system specification or implementation; manufacturingproblems; damage, fatigue, or other deterioration; and externaldisturbances, such as harsh environmental conditions, electro-magnetic interference, ionizing radiation, unanticipated inputs,or system misuse. Faults resulting from design errors andexternal factors are especially difficult to model and protectagainst, because their occurrences and effects are hard topredict. An error is a manifestation of a fault in a system, inwhich the logical state of an element differs from its intendedvalue. A fault in a system does not necessarily result in an error.An error occurs only when a fault is “sensitized”; in other words,for a particular system state and input excitation, an incorrectnext state and/or output results. A fault is referred to as latent if

X. Zhang et al. / Neurocomputing 122 (2013) 50–5752

it has not yet been sensitized in the system. The term soft isoften applied to errors that persist after the originating faultdisappears. Once corrected, soft errors usually leave no damagein the system.

Fig. 2. System architecture.

3.2. Fault tolerance computing

Fault-tolerance computing is the art and science of buildingcomputing systems that continue to operate satisfactorily in thepresence of faults [18]. A fault-tolerant system may be able totolerate one or more fault-types including: (i) transient, intermit-tent or permanent hardware faults; (ii) software and hardwaredesign errors; (iii) operator errors; or (iv) externally inducedupsets or physical damage.

The majority of fault-tolerant designs [19,20] has been directedtoward building computers that automatically recover from ran-dom faults occurring in hardware components. The techniquesemployed to do this generally involve partitioning a computingsystem into modules that act as fault-containment regions. Eachmodule is backed up with protective redundancy so that, if themodule fails, others can assume its function. Special mechanismsare added to detect errors and implement recovery. Two generalapproaches to hardware fault recovery have been used: faultmasking and dynamic recovery.

Efforts to attain software that can tolerate software designfaults have made use of static and dynamic redundancyapproaches similar to those used for hardware faults. One suchapproach, N-version programming, uses static redundancy inthe form of independently written programs (versions) thatperform the same functions, and their outputs are voted atspecial checkpoints. Here, of course, the data being voted maynot be exactly the same, and a criterion must be used to identifyand reject faulty versions and to determine a consistent valuethat all good versions can use. An alternative dynamic approachis based on the concept of recovery blocks. Programs arepartitioned into blocks and acceptance tests are executed aftereach block. If an acceptance test fails, a redundant code block isexecuted.

Process level redundancy is mainly used as a fault tolerancefor transient faults [21]. A transient fault will eventuallydisappear without any apparent intervention. Transient faultsare less severe but hard to diagnose and handle. It is caused bytemporary malfunction of some system component. Someenvironmental interference also causes transient fault or faults.Transient faults are emerging as a critical concern in thereliability of distributed system. Hardware based fault toleranceis very costly hence software based fault tolerance is used tohandle transient faults. PLR [1] is a software based techniquefor transient fault tolerance, which leverages multiple cores forlow overhead. It systematically compares the processes toguarantee correct execution. Redundancy at the process levelallows the operating system to schedule freely the processesacross all available hardware resources. PLR uses a software-centric approach to transient fault tolerance, which shifts thefocus from ensuring correct hardware execution to ensuringcorrect software execution. As a result, many benign faults thatdo not propagate to affect program correctness can be safelyignored. After all processes have been completed, the outputfiles are compared. If all files match, then the redundant outputfiles are removed and the output file will be stored using itsoriginal name. If the files do not match, then a majority rule willapply to determine the correct output as in triple modularredundancy.

4. Dual process redundancy fault tolerance

4.1. Application specified fault tolerance

The original concept of the sphere of replication (SoR) [22] is usedfor defining the boundary of reliability in redundant hardware design.The SoR is used to describe a technique's logical domain of redun-dancy and specify the boundary for fault detection and containment.Any data that enters the SoR is replicated and all executions withinthe SoR are redundant in some form. Before leaving the SoR, alloutput data are compared to ensure correctness. All execution outsideof the SoR are not covered by the particular transient fault techniquesand must be protected by other means. Faults are contained withinthe SoR boundaries and detected in any data leaving the SoR.

The PLR SoR [1] is placed around the user space application andlibraries. It acts exactly the same as the hardware SoR except thatit acts on the software instead of the hardware. Again, all inputsare replicated, execution within the SoR is redundant, and dataleaving the SoR is compared. PLR wants to get a software faulttolerance approach for a general-purpose computing domain. It ismuch more an ideal theoretical model than a practical implemen-tation. Till now, there is a no real application implementationwith it.

K2 ccNUMA server is constructed for the bank transactionprocessing. The design principle of dual process redundancy faulttolerance is to make sure that the protected process will notgenerate the error result due to the transient fault. The originalprotected process and the peer redundant process are deployed indifferent CPU nodes; when the comparison technique detects theerror, the transaction processing will guarantee that the applica-tion is running properly.

4.2. System overview

System architecture is shown in Fig. 2, that contains theccNUMA computer system hardware layer, fault-tolerant Linuxkernel based operating system layer, and application layer proce-dures. Among them, the hardware layer ccNUMA system processorarchitecture provides the basis redundant hardware for the fault-tolerant system, which is the operating platform for operatingsystem and redundant processes; system layer provides an effi-cient redundancy process management module and error detec-tion and processing module; application layer mainly provides theinterface for fault-tolerant systems and the user.

The error control of fault tolerance container includes fault-tolerant redundancy process management and control, and the

Fig. 3. Synchronization mechanism.

X. Zhang et al. / Neurocomputing 122 (2013) 50–57 53

error detection and appropriate treatment. Redundancy processmanagement includes redundant process forking, redundant pro-cess scheduling, redundant process I/O fault tolerance and syn-chronization, error detection and handling mechanism.

System meets the following functions:Transparency: Dual process redundancy is transparent for the

user and upper application, the original application creates aredundant process at the start, but to ensure that processes runuser mode semantics, namely, the same user code, the applicationdoes not need to know specific redundant mechanisms, and doesnot need any code execution-level changes or recompilation.

System level implementation: Dual process redundancy isachieved at the system level by extending the Linux operatingsystem kernel, the corresponding fault-tolerant embedded fea-tures, including redundant process fork, synchronization, schedul-ing and communications, as well as error detection and handlingmechanisms.

Software-based: Be achieved through software fault tolerance,dual process redundancy which has a different user address spaceand a separate implementation processes, and run on the same Clibrary and operating system, all user-level commands to achieve aredundant, so all error detection is done in kernel mode, that alsomakes it easier to determine what caused the error will affect theprocess, the identification of benign errors.

Redundant process fork mechanisms: As dual process redun-dancy is implemented in kernel mode, the system does not needto achieve redundancy process by calling fork explicitly, the dualprocess redundancy mechanism implemented in kernel modeonly needs to make the kernel function do_fork the correspondingexpansion, combined with the determined corresponding redun-dant signs, which at each fork, generate two equal redundancyprocess and maintain the original semantics of the fork.

5. Synchronization mechanism

The synchronization mechanism of the redundant processesguarantees that the redundant processes execute in the samefunctions and the same procedure as a necessary part in thesystem. Synchronization point is set in the system call position.System call is the interface between the application layer and thehardware devices provided by operating system kernel, the keyoperations in the implementation process, such as I/O relatedoperations (read and write files, screen print, send and receivenetwork data, etc.) that are required to use the system call. Theprocess of loading the same redundant application code goesthrough the same system call.

Implementation of the synchronization points set in the systemcall, although there are coarser synchronization system calls, butin the redundancy process of the implementation process, allcritical operating system calls are used by the kernel function, ifthe redundancy process is basically the same in the time to reachsynchronization point (time synchronization), in a relatively loosecondition to ensure synchronization. So, if we can ensure redun-dancy in the critical process of synchronization at the system call,the process of the implementation process to ensure the synchro-nization of redundant will therefore be implemented simulta-neously in the system call set-point which can be well supportedat the data comparison and error detection.

The synchronization is shown in Fig. 3, P1 and P2 redundantprocess execution through the synchronization point in the case.P1 and P2 of the execution speed can be unpredictable if P1 isfaster than P2, but when they call into the system throughsynchronization points P1 and P2 are synchronized, the systemcalls in the two processes will be in the same position leave,redundancy process is equivalent to limiting the speed of

execution, the basic time, to maintain the same speed the processof redundancy.

A synchronization mechanism for the communication needs ofdual process redundancy based on shared memory and signals ispresented. Based on the analysis of the system call of sharedmemory communication and signals, the strategies are designedfor the system call layer synchronization, kernel layer synchroni-zation and process dynamic execution. System call layer synchro-nization strategy is for shared memory and signals system call tosynchronize the dual process states. According to the analysis ofthe nature of the operation of the system call, four differentsynchronizations are designed. The kernel layer synchronizationstrategy is for the characteristics of the process to access memoryfrom kernel and to ensure that the process of shared memoryaccess operations can be executed correctly. The dynamic execu-tion of dual process is designed to ease the redundancy process ofoperation that does not balance and unbalance processor work-load. Combined with the design of the system call synchronousexecution process, the synchronization operations in signals andshared memory system call level are implemented. At the sametime, according to the fault-tolerant processing of shared memoryaccess in kernel, the shared memory kernel layer synchronizationand the dual process dynamic execution are achieved.

6. Error detection and handling

Error detection and handling are designed and implemented inthe Linux kernel layer. Error handling in kernel mode can mask theerror impact with user-level program code, and thus, to ensureerror handling timely and the correctness of the applicationexecution. Specifically, implanting detection point in the systemcall to complete error detection and transmission of information,combined with the corresponding kernel function to parse theerror messages and error handling.

Error detection: Through the implementation process of redun-dancy process, the system detects the abnormal execution flow orstate of the peer process to find the instantaneous CPU andmemory errors, including time-out detection, peer process statetesting and comparison of data mechanisms. The basic idea is,within redundant process execution, using the error detectionmodule injected in the system call synchronization points to findout time out error, and use polling to detect the process stateexception, and use the comparing of the user data cache trans-mitted into system calls by the redundancy process to completethe intermediate data detection.

Error type diagnosis: Error type diagnosis includes the determi-nation of process type and error type. The process type diagnosis isused in the multi-process environment, combined with thecharacteristics of the application under Linux. By analysis of therelationship of the process; the process is divided into two kinds ofmonitoring and child processes, monitoring processes and sub

Fig. 4. CBR-based fault diagnosis.

X. Zhang et al. / Neurocomputing 122 (2013) 50–5754

processes which belong to the same process group (the same task),monitoring process is the head of the process, the child process isthe process without child process. Process type determinationprovides process-level granularity for the error handling. Errortype diagnosis divides the error types into permanent and tran-sient error. Permanent and accurate fault diagnosis mechanismsneed to build on the corresponding hardware failure detectionmechanism, namely, through the analysis of redundant processexecution errors combined with the corresponding fault manage-ment mechanism to diagnose. Error type determination can makethe error-handling mechanisms more complete and reasonable,such as, when the CPU permanent fault is detected, the system canuse the CPU-level drop down to replace a process level drop down;it is conducive to judge process error when the dual processintermediate data are inconsistent.

Error handling: Error handling adopts different approachesaccording to different error flags, error process type and the errortype, to complete the error handling and recovery of redundancyprocess, including the process reconstruction and process dropdown. Redundancy process rebuilding refers to restart the entireapplication service automatically, mainly to handle the errorswhich are difficult to judge and restore, and the errors whichcause process drop down will affect the state of applicationservices; the error specific to the data are inconsistent, becausedual process redundancy has no voting system and cannotdetermine the error process, so the data inconsistency is definedas the error that is difficult to determine, in addition, because themonitoring process is affecting the entire process of the state ofapplication services, to drop down the monitor process, theapplication will no longer follow-up in a redundant state, so theanomaly of service monitoring process is defined to be the errorthat is difficult to restore. Redundancy process drop down refers tostop or terminate the error redundancy process, and to continuethe normal process for such services, drop down operating canensure service continuity when error occurs.

7. CBR-based fault diagnosis

Based on the error diagnosis methods of artificial intelligence, afault diagnosis method, which is applied to dual process redun-dancy system, is designed and implemented as a supplement tothe fault-tolerant system error handling method. Fault diagnosisincludes three sub-modules: compile, diagnostic engine andreasoning machine. By constructing compiler, the compile sub-module compiles the fault propagation rules, which are designedbased on the hardware manual, to the fault tree; diagnostic enginesub-module converts fault tree files to the structure of the sourcecode of the program, manages cases, and records system log;reasoning machine module implements match cases, failed logicalreasoning, fault location and logging functions.

7.1. Case-based reasoning

Case-based reasoning (CBR) is an approach to learning andproblem solving based on past experience [23,24]. A past experi-ence is stored in the form of solved problems (“cases”) in the so-called case base. A new problem is solved based on adaptingsolutions of known similar problems to this new problem. Forsolving a new problem a query is submitted to a CBR system toretrieve the solutions of the most similar problems/cases in thecase base. The query is considered as a potential new case. Theclassical view on a case imposes that it consists of a problemdescription and a described solution to this problem.

7.2. CBR-based fault diagnosis

In order to assist the error detection module to diagnose thefault, the paper presents a CBR-base fault diagnosis subsystem(Fig. 4). The system borrows the fault management idea fromSolaris FMA (Fault Management Architecture): (i) the error probegathers CPU states and memory states using Intel Machine CheckArchitecture (MCA) and Memory Controller Hub (MCH) mechan-ism; (ii) the error event, which is generated by the error probe, ispassed to diagnosis engine; (iii) based on fault tree (.eft files),diagnosis engine converts the error event to the format that can berecognized by the CBR reasoning machine; (iv) reasoning machineconducts the reasoning according to the cases; and (v) thediagnosis results are passed to the dual process redundancysystem.

8. Experimental results

The experiment environment is Intel Itanium2 CPU with 1.3 GHZ(L1 Cache 32KBL2 Cache 256KBL3 Cache 1536KB) and 1 GB mem-ory. The typical network applications, ApacheWeb server 2.2.14 andTomcat application server 6.0.20, are used to validate the systems.These servers running safely is very important [25]. The followinguses Apache Web server test as an example.

8.1. Fault injection

Mainly in software testing to discover and deal with theinstantaneous CPU or memory error, whether arising from tran-sient errors will affect the specific process, and whether it willaffect the process of implementation processes and results, isunknown. In this paper, the fault injection method is based on thespecific process [26], the rational choice and changes specific dataof the process's address space to simulate the instantaneous CPUor memory error.

Simulating memory transient error method: Select the data,which passed the write system call, from the user data buffer, andchange randomly some bits. It gets the same effect to change theuser-state data and directly change the page buffer cache, whichneeds to write to I/O device and is marked as dirty data pageframe, that is changing the frame's content, and manage thephysical memory page frame which is the basic unit, so that itcan simulate transient memory error.

Transient simulation of the CPU registers the wrong way byselecting and sending SIGSTOP signal to stop a running process,and change thread_struct or pt_regs structure, the value of eip andother fields, and then send a SIGCONT signal recovery processexecution, and when loaded into the recovery process CPU registerdata has changed, so that it can simulate register transient errors.

Fig. 5. Fault tolerance process information. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)

X. Zhang et al. / Neurocomputing 122 (2013) 50–57 55

8.2. System function testing

The test starts the Apache web service with fault-tolerant way,and the homepage of National High Performance ComputingCenter (Xi'an) is put on the server for external access. Fig. 5displays the system monitoring and management of processinformation. According to inter-process hierarchy shows: 9870,9873, 9875, 9877, 9878, and 9881 are a set of httpd serviceprocess, 9871, 9872, 9874, 9876, 9879 and 9880 for another setof httpd service process. The httpd server process relationshipbetween peer management modulars can be obtained by viewingthe monitor. For the distinction of master and slave processes,marked in red is the main process.

Based on the above description we can see, 9871 process set isthe main httpd service process set, that is bound on the CPU at 0;and 9870 process set is the slave httpd server process set that isbound to 1 CPU. 9871 and 9870 is the Apache service monitoringprocess and the remaining processes are created by their service.Pid and Twin_pid contrast can be seen, whether it is servicemonitoring process, or service of process on the child, the systemcan ensure processes present with redundant process pairs infault-tolerant system, running consistently and binding on thedifferent CPU to run, which shows that the fault-tolerant processfork and scheduling are successful. When Apache web service isstarted with fault-tolerant way, the client can access the home-page properly.

In summary, Apache server can start up and running with faulttolerant mode, fault-tolerant processes running consistently, mayprovide a good external web service. As the Apache Web serverinvolves a lot of network traffic, its fault-tolerant way to start theprocess and the provision of services covers the process of fault-tolerant process fork, scheduling, synchronization and commu-nication of various aspects, so the normal process of implementa-tion, a strong proof of fault-tolerant process management,achieves the system's functional requirements.

8.3. CPU fault tolerance testing

The CPU transient error injection core module cpu_tf.ko isinstalled in Apache client. The module selects a running redun-dancy process, and during suspend and wake up the process tochange the register saving information, triggering the processexecution flow abnormalities, and thus to test out the processand the other abnormal state detection and handling mechanism.As the error injection is random, mainly in the random errorinjection point, the CPU transient error is to change the corre-sponding data register bit randomly, so the testing process doesnot necessarily give the implementation process or change processstate; it cannot trigger other processes to detect abnormal processor condition of the timeout error. Therefore, the following testresults show that the CPU is in the transient that errors into theprocess execution flow generated under abnormal conditions.

Fig. 6 shows the redundant sub-process 3895 timed out errordetection and handling processes. First, a kernel module cpu_tf.kochooses child process 3895, and by changing the ip register thethird bit injects the CPU registers transient bit errors, thus, makingthe 3895 process execution flow abnormalities; its peer process3896 waiting for a timeout in fcntl64 system calls synchronizationpoint, and by drop down mechanism to get error handling, duringthis processing, the 3896 process makes write-protected withshared memory, and then generates the page fault exception whenwriting the address 0xb7fc6034, and calling exception handlingfunction do_page_fault, by writing copy to allocate new pageframe. Because the 3896process is slave process, so it is need tocall ft_copy_files to inherit the main process of opening the fileinformation, and by changing the sign ft_mark under the single-mode to redo sys_fcntl64. Finally 3896 completes service and re-enters accept system call, to accept the new connection and killthe 3895 and 3896 error redundant processes.

Although the system response speed gets loss compared withnon-fault-tolerant system, the maximum performance loss is

Fig. 6. kernel log parsing of child process drop down.

Fig. 7. Performance comparison of process with FT and process without FT.

X. Zhang et al. / Neurocomputing 122 (2013) 50–5756

about 19%, the average performance loss is about 15% (Fig. 7). Theperformance loss is mainly from the master–slave synchronizationprocess, from the synchronization process for the allocation of abuffer, data replication, comparison operations. For the currentsystem with well handling ability, the fault-tolerant operationitself takes additional time which is very small; the loss ofperformance will not cause orders of magnitude differences.Integrated by the functionality and performance testing of ApacheWeb server and Tomcat application server, the fault tolerancesystem can provide good support to meet the functional require-ments and performance for these two applications.

9. Conclusions

In this paper, 863 major project “Inspur Tiansuo high-end fault-tolerant server” is used as the background to study the availabilityof the system critical applications, based on process-level redun-dancy to achieve fault tolerance mechanism, focusing on theccNUMA architecture redundancy process synchronizationmechanism, redundant processes error detection and handlingtechniques. The dual process redundancy systems and processerror detection and handling are designed and implemented inLinux kernel. To detect errors and processing errors as the maingoals, through overtime, methods to detect the resulting executionflow of the redundancy process or abnormal state of transienterrors such as memory and CPU, combined with the appropriatetype of error diagnosis and error handling mechanism toensure that redundant tasks service continuity and the stabilityof subsequent run in order to improve application reliability andavailability.

Through the fault tolerant implementation of Apache Webserver and Tomcat application server, the effectiveness of the dualprocess fault tolerance system is validated. Compared with thetraditional fault tolerance way, this method not only effectivelyavoids the complexity of hardware implementation, but a;somakes the application transparent. The proposed fault-tolerantsystems and processes to achieve the system to enrich the system-level software fault-tolerant technology, also for high-end fault-tolerant computer project in the development of software fault-tolerant system provides a technical reference and support.

Acknowledgments

This work was supported by the project of the National KeyTechnology R&D Program (Grant no. 2011BAH04B03), the NSFCproject (Grant no. 61173039) and the State 863 project of China(Grant no. 2008AA01A202). The authors wish to thank Qi Dong,Peng Yi and Tao Wu for the useful discussions and experimentalwork.

References

[1] A. Shye, J. Blomstedt, T. Moseley, V.J. Reddi, D.A. Connors, Plr: a softwareapproach to transient fault tolerance for multi-core architectures, IEEE Trans.Dependable Secure Comput. (TDSC) 6 (2009) 135–148.

[2] Scott Hareland, Jose Maiz, Mohsen Alavi, Kaizad Mistry, Steve Walsta,Changhong Dai, Impact of CMOS scaling and SOI on software error rates oflogic processes, VLSI Technology Digest of Technical Papers, Technical Report,2001.

[3] Tanay Karnik, Bradley Bloechel, K. Soumyanath, Vivek De, Shekhar Borkar,Scaling trends of cosmic rays induced soft errors in static latches beyond 0.18,VLSI Circuit Digest of Technical Papers, Technical Report, 2001.

[4] TS, et al., Ibm's s/390 g5 microprocessor design, IEEE Micro, vol. 19, 1999,pp. 12–23.

[5] HA, et al., A 1.3 ghz fifth generation sparc64 microprocessor, in: Proceedings ofthe 40th Conference on Design Automation (DAC '03), 2003.

[6] Nahmsuk Oh, Philip P. Shirvani, Edward J. McCluskey, Error detection byduplicated instructions in super-scalar processors, IEEE Trans. on Reliab. 51 (1)(2002) 63-75, http://dx.doi.org/10.1109/24.994913.

[7] Nahmsuk Oh, Philip P. Shirvani, Edward J. McCluskey, Control-flow checkingby software signatures, IEEE Trans. on Reliab. 51 (2) (2002) 111–122, http://dx.doi.org/10.1109/24.994926.

[8] G.A. Reis, J. Chang, N. Vachharajani, R. Rangan, D.I. August, Swift: softwareimplemented fault tolerance, in: Proceedings of the International SymposiumCode Generation and Optimization (CGO), 2005.

[9] X. Zhang, E. Wang, F. Tang, M. Yang, H. Wei, X. Dong, Transient fault tolerancefor ccnuma architecture, in: Proceedings of the 6th International Conferenceon Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS-2012), Palermo, Italy, 2012.

[10] S. Yau, F.-C. Chen, An approach to concurrent control flow checking, IEEETrans. Softw. Eng. SE-6 (1980) 126–137.

X. Zhang et al. / Neurocomputing 122 (2013) 50–57 57

[11] K.-H. Huang, J. Abraham, Algorithm-based fault tolerance for matrix opera-tions, IEEE Trans. Comput. C-33 (1984) 518–528.

[12] A. Benso, S. Chiusano, P. Prinetto, L. Tagliaferri, A c/c++ source-to-sourcecompiler for dependable applications, in: International Conference onDependable Systems and Networks (DSN), 2002.

[13] M. Rebaudengo, M.S. Reorda, M. Violante, M. Torchiano, A source to sourcecompiler for generating dependable software, in: First IEEE InternationalWorkshop on Source Code Analysis and Manipulation, 2001.

[14] M. Rela, H. Madeira, J. Silva, Experimental evaluation of the failsilentbehaviour in programs with consistency checks, in: Proceedings of the AnnualSymposium on Fault Tolerant Computing, 1996.

[15] J. Lidman, D. J. Quinlan, C. Liao, S.A. McKee, Rose::fttransform—a source-to-source translation framework for exascale fault-tolerance research, in: IEEE/IFIP 42nd International Conference on Dependable Systems and NetworksWorkshops (DSN-W), 2012.

[16] C. Wang, H. seop Kim, Y. Wu, V. Ying, Compiler-managed software-basedredundant multi-threading for transient fault detection, in: InternationalSymposium on Code Generation and Optimization (CGO '07), 2007.

[17] V.P. Nelson, Fault-tolerant computing: fundamental concepts, IEEE Comput.23 (1990) 19–25.

[18] D.A. Rennels, Fault-tolerant computing, Encycl. Comput. Sci. 4 (1999) 698–702.[19] S.K. Mak, P.-F. Sum, C.-S. Leung, Regularizers for fault tolerant multilayer

feedforward networks, Neurocomputing 74 (2011) 2028–2040.[20] J. Pajarinen, J. Peltonen, M. Uusitalo, Fault tolerant machine learning for

nanoscale cognitive radio, Neurocomputing 74 (2011) 753–764.[21] I.T. Sanjay Bansal, Sanjeev Sharma, A detailed review of fault-tolerance techni-

ques in distributed system, Int. J. Internet Distrib. Comput. Syst. 1 (2011) 33–39.[22] S.K. Reinhardt, S.S. Mukherjee, Transient fault detection via simultaneous

multithreading, in: Proceedings of the 27th Annual International SymposiumComputer Architecture (ISCA), 2000.

[23] K. D. Althoff, Case-based reasoning, Handbook of Software Engineering andKnowledge Engineering, vol. 1, 2002, pp. 549–588.

[24] L. Ogiela, M.R. Ogiela, Advances in Cognitive Information Systems, CognitiveSystems Monographs, vol. 17, Springer-Verlag, Berlin-Heidelberg, 2012.

[25] B. Choi, K. Cho, Detection of insider attacks to the web server, JoWUA3 (4) (2012) 35–45.

[26] S. Kim, J. Park, K. Lee, I. You, K. Yim, A brief survey on rootkit techniques inmalicious codes, JISIS 2 (3/4) (2012) 134–147.

Xingjun Zhang received his Ph.D degree in ComputerArchitecture from Xi'an Jiaotong University (XJTU),Xi'an, China. From 1999 to 2005, he was Lecturer,Associate Professor in the Department of ComputerScience & Technology of Xi'an Jiaotong University. FromFebruary 2006 to January 2009, he was a ResearchFellow in the Department of Electronic Engineering ofAston University, United Kingdom. He is currently anAssociate Professor in the Department of ComputerScience & Technology of Xi'an Jiaotong University. Hisinterests include high performance computer architec-ture, the new technologies of the computer networks

and high performance computing.

Endong Wang received his B.S. and M.S. from TsinghuaUniversity in 1988 and 1991, respectively. He is now thevice President of Inspur Group. His research interestsinclude high performance computing, computer archi-tecture, parallel and distributed processing and micro-procrssor architecture.

Feilong Tang received his Ph.D degree in ComputerScience and Technology from Shanghai Jiao Tong Uni-versity (SITU) in 2005. He was a JSPS (Japan Society forthe Promotion of Science) Postdoctoral ResearchFellow. Currently, he works with the School of Softwareof SJTU, China. His research interests include cognitiveand sensor networks, protocol design and performanceanalysis for communication networks, and pervasiveand cloud computing. He conferences, and works asPrinciple Investigators of many projects such asNational Natural Science Foundation of China (NSFC)and National High-Tech R&D Program (863 Program) of

China. He has served as program co-chairs, co-editors

and track chairs for more than 15 international conferences (workshops).

Meishun Yang received his B.S. from X'ian JiaotongUniversity, China, in 1981. He is an associate professorin the Department of Computer Science and Technol-ogy of X'ina Jiaotong University. His current researchinterests are in high performance computing andsystem on chip techonology.

Hengyi Wei received his B.S. from Xi’an Jiaotong Uni-versity, China, in 1981. He is an associate professor inthe Department of Computer Science and Technologyof Xi’an Jiaotong University. His current research inter-ests are in high performance computer architectureand software engineering.

Xiaoshe Dong received his B.S. and M.S. from Xi’anJiaotong University, China, in 1987 and 1990, respec-tively, and his Ph.D. in Computer Science from KeioUniversity, Japan, in 1999. He is a professor in theDepartment of Computer Science and Technology ofXi’an Jiaotong University, and also is the director of theNational High Performance Computing Center (Xi’an)in China. His current research interests are in highperformance cluster computing, grid computing, andsystem on chip technology.