[IEEE 2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC) - Fukuoka Prefecture, Japan (2010.11.4-2010.11.6)] 2010 International Conference on

A Solution for Fault-Tolerance Based on Adaptive Replication in MonALISA

Alexandru Costan, Mugurel Ionut Andreica, Valentin Cristea

Computer Science Department University Politehnica of Bucharest

Bucharest, Romania {alexandru.costan, mugurel.andreica,

valentin.cristea}@cs.pub.ro

Costin Grigoras ALICE Offline

CERN – European Organization for Nuclear Research Geneva, Switzerland

[email protected]

Abstract— The domains of usage of large-scale distributed systems have been extending during the past years from scientific to commercial applications. Together with the extension of the application domains, new requirements have emerged for large-scale distributed systems. Among these, fault tolerance is needed by more and more modern distributed applications, not only by the critical ones. In this paper we present a solution aiming at fault tolerant monitoring of the distributed systems within the MonALISA framework. Our approach uses replication and guarantees that all processing replicas achieve state consistency, both in the absence of failures and after failure recovery. We achieve consistency in the former case by implementing a module that ensures that the order of monitoring tuples is the same at all the replicas. To achieve consistency after failure recovery, we rely on checkpointing techniques. We address the optimization problem of the replication architecture by dynamically monitoring and estimating inter-replica link throughputs and real-time replica status. We demonstrate the strengths of our solution using the MonALISA monitoring application in a distributed environment. Our tests show that the proposed approach outperforms previous solutions in terms of latency and that it uses system resources efficiently by carefully updating replicas, while keeping overhead very low.

Keywords: fault tolerance, replication, monitoring, distributed systems, Grid computing

I. INTRODUCTION Large scale distributed systems are hardly ever “perfect”.

Due to their complexity, it is extremely difficult to produce flawless designed distributed systems. Fault tolerance is the ability of a large-scale distributed system to perform its function correctly even in the presence of faults occurring in various components. Traditional approaches for high availability (high resilience to faults occurrences) are based on the combination of redundancy and 24/7 operations support, which often prove prohibitive expensive. The characteristics of large-scale distributed systems make fault tolerance a difficult problem from several points of view. A first aspect is the geographical distribution of resources and users that implies frequent remote operations and data transfers. These lead to a decrease in the system's capability to detect faults, to manage correct group communications and consensus. Another

problem is the volatility of the resources, which are usually available only for limited periods of time. The system must ensure the correct and complete execution of the applications even in the situations when the resources are introduced and removed dynamically, or when they are damaged. Solving all these issues still represents a research domain.

One widely used technique to guarantee the availability and dependability of large-scale distributed systems in the presence of faults is replication. Data replication is also a good technique to support load balancing. By having several distributed nodes, one can spread the incoming requests equally to all the participating nodes that store replicas of the targeted data. In a typical distributed environment that collects or monitors data, useful data may be spread across multiple distributed nodes, but users or applications may wish to access that data from a central location (data repository). A common way to ensure centralized access to distributed data is by means of maintaining replicas of data objects of interest at a central location. However, when data collections are large or volatile, keeping replicas consistent with remote master copies poses a significant challenge due to the large communication costs incurred. In this paper we address these challenges by proposing a data replication architecture for the repository backend of the MonALISA [1] distributed monitoring framework.

MonALISA is used in several large-scale collaborations to collect monitoring information from the computing nodes, storage systems, data-transfer applications, and software running in the local clusters. This yields more than 2 million parameters published in MonALISA, with an average update frequency of one minute. Moreover, specific filters aggregate the raw parameters to produce system-overview parameters in real-time. These higher-level values are usually collected, stored, and presented using web interfaces at central locations, the MonALISA Repositories. Hence, the MonALISA Repository is the most stressed component of the entire monitoring architecture, dealing with hundreds of client queries per minute and meeting high availability and reliability standards. We therefore designed a fault-tolerant, asynchronous multi-master replication architecture of the repository database. With our approach, read and write database queries are

2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing

978-0-7695-4237-9/10 $26.00 © 2010 IEEE

DOI 10.1109/3PGCIC.2010.63

375

numbered, distributed among replicas, handled by means of interval trees and further used for synchronization when failures occur. We further optimized the execution of queries implementing prepared statements. Thus, we achieved a high degree of fault tolerance and load balancing for every user interaction with the interface, being able to dynamically generate charts and serve user data requests in short time.

The remainder of this paper is organized as follows. In Section 2, we survey the related work and argue the originality of our approach. In Section 3, we present the context of our work, the MonALISA monitoring framework, and detail the targeted component – the monitoring data repository. Section 4 gives an overview of our proposed replication architecture while Section 5 describes the implementation details. In Section 6, we evaluate our solution using several testing scenarios and interpret the results. Finally, Section 7 concludes this paper.

II. RELATED WORK The current replication techniques can be classified in two

classes: active and passive replication. In case of active replication, each request is processed by all replicas. The technique ensures a fast reaction to failures. However, active replication uses processing resources heavily and requires the processing of requests to be deterministic. With passive replication (also called primary-backup), only one replica (primary) processes the request, and sends update messages to the other replicas (backups). The technique uses fewer resources than active replication does, without the requirement of operation determinism. On the other hand, the replicated service usually has a slow reaction to failures. An alternative approach to replication combines the power of these two classes. The new class, called semi-passive replication [2], retains the essential characteristics of passive replication while avoiding the necessity to force the crash of suspected processes.

Zyzzyva [3] is a Byzantine Fault Tolerance protocol that uses speculation to reduce the cost and simplify the design of the state machine replication. In Zyzzyva, replicas respond to a client’s request by optimistically adopting the order proposed by the primary. This approach has the advantage of reducing the time needed for the client to receive a response. However, replicas can become temporarily inconsistent with one another, but clients detect inconsistencies, help correct replicas converge on a single total ordering of requests, and only rely on responses that are consistent with this total order. This approach allows Zyzzyva to reduce replication overheads to near their theoretical minima.

Due to the overhead incurred, in many real-world environments exact replica consistency is not maintained. Some form of inexact, or approximate, replication is typically used instead. Approximate replication [4] is often performed by refreshing replicas periodically. Periodic refreshing allows communication cost to be controlled, but it does not always make good use of communication resources. Two natural and complementary methods for working with the precision-performance tradeoff are proposed to achieve efficient communication resource utilization for replica synchronization:

maximize replica precision in the presence of constraints on communication cost and minimize communication cost in the presence of constraints on replica precision. In [5,6] other optimal replica placement strategies for different types of distributed systems (e.g. data grids) are presented. A replication approach for clusters was presented in [7]. Practical fault tolerance and replication solutions in WAN were described in [8,9].

Several replication solutions exist for commodity hardware. C-JDBC (Clustered-JDBC) [10] is an open-source middleware solution for database clustering. C-JDBC offers various load balancers according to the degree of replication the user wants. Database updates need to be sent to all nodes, and performance suffers from the need to broadcast updates when the number of backends increases. To address this problem, C-JDBC provides partial replication in which the user can define database replication on a per-table basis. Load balancers supporting partial replication must parse the incoming queries and need to know the database schema of each backend. Postgres-R [11] is designed to run on shared-nothing clusters with a low latency interconnect. Slony-I [12] is an asynchronous replicator of a single master database to multiple replicas, which in turn may have cascaded replicas.

We notice however that these systems cannot fit to our specific monitoring requirements. Some of them are designed only for the PostgreSQL database management system (DBMS). But one of the MonALISA repository’s purpose is to not induce any constraints regarding the use of DBMSs, so each client may use its DBMS (Oracle, PostgreSQL, MySQL, etc.) in order to run the repository. Furthermore, some of the presented replication models introduce a relatively large overhead while processing the queries. For example, Postgres-R uses changesets to replicate the databases, introducing a relatively large delay, even if the replicas are completely synchronous. Moreover, Slony-I is a master-slave replication model. Because one of our goals was using a multi-master database replication model, we decided not to use this replication model. Finally, some of these models are experimental software (e.g. Postgres-R) and are not used too much in industry (C-JDBC). All these observations motivated our research for a new replication model able to efficiently cope with the high requirements of a large scale monitoring system.

III. THE MONALISA FRAMEWORK MonALISA is a distributed monitoring system, relying on

JINI for discovery of running instances and the services they offer, able to provide complete monitoring, control and global optimization services for complex systems. Within this framework we developed a repository system, able to present global views from the dynamic set of services running in a distributed environment to higher level services. The system subscribes to a set of parameters or filter agents to receive selected information from the farms, and stores this monitoring information locally, using space and time optimizations.

Data is collected in order to enable retrospective analysis, thus dealing with large amounts of information and introducing reduction and aggregation mechanisms. A servlet engine is

376

used to dynamically present customized views of the jobs, resource usage, system and network key metrics in a flexible way. More than 350 MonALISA services are running around the clock throughout the world. These services monitor more than 20,000 compute servers, hundreds of WAN links, and tens of thousands of concurrent jobs. More than 2 million parameters are monitored in near real-time with an aggregate update rate of approximately 35,000 parameters per second. Some of these parameters (usually global, aggregated values, but also details when needed) are collected and archived over long periods of time in the Repository component.

The MonALISA Repository system offers both a global view of the entire Grid and plots for several major aspects: components, services, jobs, nodes, and network transfers traffic. Being a central component that collects global data and serves as a user interface, the repository runs under heavy concurrency constrains, dealing with large numbers of queries per second, for aggregating, processing and visualizing data. Global MonALISA repositories are used by many communities to aggregate information from many sites, properly organize them for the users, and keep long-term histories. During the past year, the repository system served more than 8 million user requests.

Therefore an important concern of the MonALISA repository is the discontinuity of service. This means not only disrupting users and software clients accesses to data, but also the absence of data for some time intervals. There are multiple factors that could lead to a failure of a central repository: hardware failure, operating system and non-responsive applications, heavy traffic and network failure. Replication is a proven way to enhance the availability of the system. Our approach deals with the replication of the database backend while the replication of the repository server and user interface was discussed in [15]. We detail in the following section our adopted solution.

IV. THE REPLICATION ARCHITECTURE Our replication architecture, depicted in Figure 1, is built

around 3 components, that we detail in the following.

Figure 1. Overview of the replication architecture

A. The Replication Manager The main module of our replication architecture implements a Replication Manager (RM). Its role is to intercept all the queries coming from the monitored farms and clients and passes them to the replicas whenever it is queried. The RM saves those queries that modify data (update, insert or delete SQL commands) until all the replicas receive them. The queries are tagged with sequence numbers, which are used by replicas when they want to update their databases or in the answers given by the RM to the queries coming from the replicas. The Replication Manager consists of 3 sub-modules: the file remover module, the replica listener module and the init module.

• The File Remover Module removes files with queries from its local storage. It receives confirmations from replicas that certain intervals of queries have been executed, inserts those intervals in an interval tree and it queries this interval tree to see if there is any interval that have been completely received by all replicas. If there is any, the RM deletes the queries with sequence numbers in that interval, removing also those intervals from the interval tree.

• The Replica Listener module asynchronously listens for connections from replicas and solves the requests. Taking into account the fact that the RM holds the queries until all the replicas receive them, every request to the RM should be successful.

• The Init Module receives the queries and based on the type of the query (a select or a non-select query) it looks through the online backends and selects the most updated replica or it logs the query, respectively. The most updated replica is the one that has executed the most queries.

B. Replica This module receives queries from the RM and saves them

into files. At predefined interval of times, it reads the queries from these files and executes them. It also sends to the RM some messages in which it requires the sequence number of the last received query. If this differs from the last sequence number received locally, then the replica tries to get the lost queries from the other replicas or, if the attempts to take files from the other peers fail, it gets them from the RM. A Replica consists of 3 modules: the File Remover Module, the Query Demander Module and the Query Sender Module.

• The File Remover Module periodically deletes files from which queries were already read and executed. The period of time at which the rescheduling does occur influences the overall performance of the architecture: thus, in an unstable architecture, if the period of time is big enough, it increases the chances that a replica finds some files with queries at a peer, thus alleviating the RM which is the core of our architecture.

• The Query Demander Module connects periodically to the Replication Manager and requires the sequence number of the last received query. If it is greater than

377

the last query’s sequence number received locally, then it tries to get the lost queries. It will try to retrieve the queries from the RM, but also from the other replicas.

• The Query Sender Module accepts connections from the other peers and serves the requests with files with queries. Thus it receives messages in which it is asked to send some files. It sends the requested files if these are found in their local storage or send null instead.

C. The Monitoring Module The third module monitors the link states between the

replicas and between the replication manager and replicas. Whenever a replica wants to update its database, it first queries the monitor module to see from which replicas it should transfer the missing files. The monitor module finds the target replicas doing some heuristics on some parameters like the load on all the replicas, the CPU usage, the physical memory usage, the availability and the available bandwidth between the demander replica and the other replicas. After computing a score for each of the replicas, this module returns to the demander replica a list with the most unloaded and available replicas from which it should begin copying the files.

V. IMPLEMENTATION DETAILS The MonALISA repository is configured to accept

monitoring data with a certain frequency from farms by means of predicates. A predicate has the following pattern: Farm / Cluster / Node / start time / end time / function list. Hence, the number of tables and the number of fields from tables is constant (the great majority of them have 4 fields: recvtime, mval, mmin, mmax), these being the main reasons that determined us to use Prepared Statements (PS) for the execution of queries on replicas. In order to do this optimization, a special data structure - prepared statement pool (PSP) – was used. The PSP holds a certain number of PSs and each PS is represented by a special data structure, which is described below.

First, we need the query and a signature for it. The signature may be the list of fields from the query or, in order to minimize the data structure’s size and the time spent on verifying if a query is generated from a certain PS, a hash function was applied on the query’s fields. In that data structure we also used a field named usage_count that is the number of queries received by the RM from a certain point in time that matched with that PS, then a field called timestamp that is the moment in time when the PS was last used. These two fields are used when the PSP cache is full and an entry has to be removed before adding a new one (thus a least-used cache). When we have to execute a query that doesn’t match any PS from the PSP, a new PS will be made for that query and introduced in PSP. If the PSP is full, the PS with the oldest timestamp and lowest usage_count will be removed from PSP. The maximum number of entries from the PSP was chosen on the basis of the dimension of the memory allocated by the DBMS for processing the incoming queries.

The file remover module was implemented by means of an interval tree [14]. The algorithm executed by this module is

Figure 2. The file removing algorithm

described in Fig. 2. The complexity of this algorithm is O(log N * log M), where N was previously mentioned, and M represents the number of files with queries held by the RM.

Regarding the query demander module, we assumed that the last queries must be retrieved from the RM because it may have been deleted at the other replicas by their file remover modules and the latest queries should also be retrieved from the RM due to the fact that they must not been yet received by the other replicas. The remaining queries should be taken from the other peers (replicas). The algorithm used to take the lost queries from the other replicas is the following: first of all, a replica queries the Replication Manager and gets a list with the replicas that are available and are the least loaded. There is a number of threads equal to the number of replicas previously found, and every thread connects to a specific replica. A thread tries to take a value from a queue that contains the ids of the files which must be taken from replicas and tries to get that file from the other peer. If a request to a peer fails (the replica cannot connect to the other peer or it doesn’t receive anything due to the fact that that peer does not hold the file), then the file number is put into another queue and, at the next iteration, it will be requested from the RM, because it is known that this holds the queries until it is sure that all the replicas have received them.

The implementation of this thread is fault tolerant. Due to the fact that a replica can fail at any time and a replica accepts simultaneously data from other replicas, we developed a mechanism by which we avoid requesting some files that have been completely copied previously. Thus, when a full file arrives at the replica (from a peer or from the RM), we write the name of the full file in a data structure and, also, we write it into a file. However, if the transfer of all files hasn’t been completed and the replica fails, then, at restart, that file will not

378

be copied again. But, if there is a file that has not been completely copied, at restart, this file is deleted, thus the replica will copy it again. Our copy algorithm is distributed. The replica tries to copy the missing files from multiple replicas at the same time using some worker threads. A worker thread picks up jobs from a queue and executes them. Thus it tries to connect to a replica and requests a file. If the replica is not online then it retries to pick another job. However, if it can connect to other peer but doesn’t receive the requested file, it puts the file id in a new queue from which the main thread will pick up tasks and solves them by querying the RM.

The monitoring module periodically receives some parameters from the replicas. These parameters are collected by means of ApMon [1], which is a set of flexible APIs that can be used by any application to send monitoring information to MonALISA services. Furthermore, it is very light-weight and non-intrusive and it has the advantages of flexibility, dynamic configuration and high communication performance. Based on the received monitoring parameters, it assigns a score for every replica (doing a weighted mean) and, whenever it is queried, it sends the most suitable replicas from which the querying replica can copy its missing files. The system administrator can control the replication procedure by means of the replication manager. Thus he can send some messages to the RM telling it to draw a replica from the replication architecture or to add a new one. The only details that must be sent are the Internet Address and the listening port of the replica.

VI. SYSTEM EVALUATION In order to evaluate our system’s performance and

reliability, we relied on real production data collected from several LHC experiments that are instrumented with MonALISA (ALICE, CMS), monitoring data from high-speed data networks (USLHCNet) and Grids (OSG). The ALICE experiment, part of CERN's Large Hadron Collider, has developed the ALICE production environment (AliEn), which implements many components of the Grid computing technologies that are needed to analyze ALICE data. During the last year more than 23,000 jobs have been run under AliEn control worldwide, totaling 3000 CPU years and producing 20 Tbytes of Monte Carlo data, while the real data that was collected and analyzed during this period has reached over 2 Pbytes. USLHCNet provides reliable, dedicated, high-bandwidth connectivity between the U.S. Tier1 centers and CERN (e.g., the LHC Optical Private Network (OPN) physical links). Finally, the Open Science Grid (OSG) is a distributed computing infrastructure for large-scale scientific research. The repositories collecting data from these Virtual Organizations are read/write-intensive, the proof being the number of served pages and the number of received values: 30 pages per minute and about 8,000 values per minute for the OSG repository, 120 pages and about 10,000 parameters per minute for the ALICE Repository.

In order to test our system we used a distributed experimental setup consisting of 3 nodes geographically dispersed to run our RM, Monitoring Module and Replicas. One of them was located at the University Politehnica of

Figure 3. Monitoring chart from a MonALISA Repository without replication in the presence of failures

Figure 4. Monitoring chart from a MonALISA Repository with replication

Bucharest (UPB), Romania, one at Vrije University,Amsterdam, the Netherlands and one at CERN, Switzerland. We have started 2 replicas, one at CERN and one at Vrije and also the Replication Manager and the Monitoring Module at UPB. First, we started the replica at CERN, after one hour we started the replica at Vrije, and immediately stopped the replica at CERN. With this scenario, we observed that after a few seconds, even if the most updated replica (the one located at CERN) was down, the repository worked normally due to the fact that the replica at Amsterdam was providing accurate up-to-date data.

The figures below show the behavior of the MonALISA repository without replication (Fig. 3) and with replication (Fig. 4) within the test mentioned above. As one can observe, there are some “gaps” in the first figure due to lack of data, while in the second one the online replica provided the client with data transparently. We repeated these tests increasing the number of replicas from 2 to 20 while incrementally increasing the time intervals of failures. We noticed a consequent increase of the synchronization time of the replicas while still performing well and answering clients’ requests with data ranges.

379

Figure 5. The convergene time of a replica in the case it was down for a day

.

Moreover, we monitored the overhead introduced on the Replication Manager by our replication architecture and concluded that it is almost undetectable. It slightly increases with the number of replicas from the system and suffers an increase only in the following corner case: there is a large number of replicas (6 in our test), all of them are down for a period of time (e.g. 1 day), and after that all of the replicas are started at the same time. The overhead is due to the fact that the Replication Manager is the only entity that may provide the replicas with data and thus it is stressed until all the replicas update their databases.

We also tested the time needed by a replica to become synchronous with the other ones. In order to do this test, we stopped a replica for a day, restarted it and monitored the time after which the replicas have converged. As we can see from Fig. 5, this time is influenced by the number of replicas from the replication architecture, having an asymptotic behavior. The reason of this fact would be that the execution time of the queries into the restarted replica doesn’t depend on the number of replicas and, furthermore, the copying algorithm reaches some upper limits and doesn’t offer better results for increased number of replicas. Taking into account the results of our experimental setup, we argue that our replication architecture behaves correctly, is fault-tolerant and because it implements a model of lazy replication (also known as optimistic replication) it doesn’t introduce any propagation delays in transmitting data from the RM to replicas.

VII. CONCLUSIONS This paper addresses the challenges raised by the

enhancement with fault tolerance support of a large scale distributed monitoring system. Our goal was to invest the system with dependability features in the presence of failures, transparently for the end clients. We propose a replication architecture for the MonALISA repository. We implemented

an optimistic replication model, which doesn’t introduce any delay on forwarding the messages from the replication manager to the replicas, while keeping the induced overhead within intrusive limits.

The next step will consist in fitting the file transfer scheduling algorithm for autonomic behavior and for fast data transfers. We also intend to improve the security of our approach; currently our architecture relies on the security infrastructure of the underlying monitoring system: MonALISA inherits many features from the Globus Security Infrastructure (GSI), which we want to extend with new components due to the specific requirements of our approach.

REFERENCES [1] I. C. Legrand, H. Newman, R. Voicu, C. Cirstoiu, C. Grigoras, M.

Toarta, C. Dobre. MonALISA: An Agent based, Dynamic Service System to Monitor, Control and Optimize Grid based Applications, Proc. of the Intl. Conf. on Computing in High Energy and Nuclear Physics, 2004, pp. 907-910.

[2] X. Defago, A. Schiper. Specification of Replication Techniques, Semi-Passive Replication, and Lazy Consensus. Report for School of Knowledge Science, Japan Advanced Institute of Science and Technology, 2002.

[3] R. Kotla, M. Dahlin. High throughput Byzantine fault tolerance. Intl. Conf. on Dependable Systems and Networks, (pp. 575), IEEE Computer Society, 2004

[4] C. A. Olston. Approximate Replication. Doctoral Thesis. UMI Order Number: AAI3090652., Stanford University, 2003

[5] X. Tang, S. T. Chanson. Optimal Replica Placement under TTL-Based Consistency, IEEE Trans. on Par. and Distrib. Systems 18 (3), 2007, pp. 351-363.

[6] P. Liu, J.-J. Wu. Optimal Replica Placement Strategy for Hierarchical Data Grid Systems, in Proc. of the 6th IEEE Intl. Symp. on Cluster Computing and the Grid (CCGrid), 2006, pp. 417-420.

[7] L. Marowski-Bree. A new Cluster Resource Manager for Heartbeat. Proc. of the UKUUG LISA/Winter Conference, High-Availability and Reliability, 2004.

[8] W. Zhang. Linux virtual server for scalable network services, Proc. of the Linux Symp., Ottawa, 2000.

[9] Y. Amir, B. A. Coan, J. Kirsch, J. Lane. Customizable Fault Tolerance for Wide-Area Replication. Proc. of the IEEE Symp. on Reliable Distrib. Systems (SRDS), 2007, pp. 65-82.

[10] C-JDBC: Flexible Database Clustering Middleware, Emmanuel Cecchet, Julie Marguerite, Willy Zwaenepoel

[11] Postgres-R web page: http://www.postgres-r.org/documentation/ [12] Slony-I Replicator web page avaible at:

http://www.onlamp.com/pub/a/onlamp/2004/11/18/slony.html [13] E. Tirsa, M. Andreica, A. Costan. Data Replication Techniques with

Applications to the MonALISA Distributed Monitoring System. IEEE Region 8 Conference EUROCON 2009, Sankt-Petersburg, Russia.

[14] T. Cormen, C. Leiserson, R. Rivest. Introduction to algorithms [15] Olston, C. A. 2003 Approximate Replication. Doctoral Thesis. UMI

Order Number: AAI3090652., Stanford University.

380

Documents

[IEEE 2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC) - Fukuoka Prefecture, Japan (2010.11.4-2010.11.6)] 2010 International Conference on