Storage Allocation

Data Allocation in Distributed Data Base in the Network Using Modularization Algorithm

ABSTRACT— Data allocation issue is one of the significant and essential discussions in distributed data base systems. In these systems, operators are located in different places and have to get access to a joint data base. These data have been distributed within the network. Many models have been designed in order to resolve this problem. Also, some algorithms are devised for the purpose of decreasing data transferring expense and decreasing the time of responding. The most important expense belongs to data transferring. This cost equals to a query’s access to the required data from different sites to the place in which the query has been produced. In this article, we deal with modularization algorithms which make data transferring expense reduce during a query execution. These algorithms accomplish this by optional modularization of the sites in which the data are located.

Keywords: Data allocation; Modularization algorithm; Clustering; Communication cost; Fragment allocation.

I. INTRODUCTION Allocating data in distributed data base systems is very hard and regarded as one of the primary problems in distributed systems. Sometimes, this allocation requires other items like fragmentation, replication, and centralization. In fragmenting, relations are divided into some pieces which will be horizontal, vertical or combinatory. When different fragments are related, we have to decide the kinds of fragments to be put in different places[1]. This decision is one of the problems in fragmentation. In replication, we might need the same copies of one fragment which is a difficult work. Difficulty in this phase relates to determining the kind of segment to be duplicated and repeated as well as designating the best place for this segment. All the fragmented parts of one relation should be saved in one fragment. Also, we ought to find the best place for its centralization. The main and the most significant purpose of data allocation is to reduce the expenses in order to execute a series of queries and to determine the proper attributions from the data fragments to various sites. The major expense is the data transferring expense for the query’s access to the data from various sites the place in which the query has been produced. We should determine the amount data transferred during the query process. In fact, the place which involves the most data requires less transferring expense[3]. Therefore, the aim is to increase the amount of data in a specific place from which the operator is processing a query and needs the scattered data within the system. Of course,

when transferring are being executed, the contributory site might be used as well. Some issues should be taken into consideration here including the space reservation prior to transferring and space release after transferring. In distributed data base system (DDS), the operators need to have access to joint base in different places. These data have to be distributed across the network and be available for the operator with the less expense. The data transferring would lead to failure due to some destruction. In the case that the transferred data is large, we won’t be able to transfer it again if it fails. In this situation, the best thing to do is to conduct the transferring from the remaining part. When transferring and allocating data to the sites, we have to reduce the memory overheads as well as the amount of unrelated demanded data. The problems of data allocation belong to the complete NP and there is a need to design specific algorithms for optimizing time complexity and its space. in this article, we have explored several methods and algorithms and also we have compared their results together then, modularization algorithm. by optimal modularization of the sites which involve the queries, this algorithm produces optimal results in practice in terms of data transferring expense in the time of performing the queries.

II. A REVIEW OF DATA ALLOCATION ALGORITHMS IN

DISTRIBUTED DATA BASE

A. The Genetic Algorithm Genetic algorithm (GA) based search methods[5,6,7,8] are inspired by the mechanisms of natural genetics leading to the survival of the fittest individuals. Genetic algorithms manipulate a population of potential solutions to an optimization problem . Specifically, they operate on encoded representations of the solutions, equivalent to the genetic material of individuals in nature, and not directly on the solutions themselves. In the simplest form, solutions in the population are encoded as binary strings. As in nature, the selection mechanism provides the necessary driving force for better solutions to survive. Each solution is associated with a fitness value that reflects how good it is, compared with other solutions in the population[4]. The higher the fitness value of an individual, the higher the chance of survival in the subsequent generation. Recombination of genetic material in genetic algorithms is simulated through a crossover

Rezvan mahmoudie Sa eed parsa Qazvin Islamic Azad University, Young researchers club Iran University of Science and Technology

Qazvin,Iran Tehran,Iran [email protected] [email protected]

2010 12th International Conference on Computer Modelling and Simulation

978-0-7695-4016-0/10 $26.00 © 2010 IEEE

DOI 10.1109/UKSIM.2010.24

498


978-0-7695-4016-0/10 $26.00 © 2010 IEEE

DOI 10.1109/UKSIM.2010.24

498


978-0-7695-4016-0/10 $26.00 © 2010 IEEE

DOI 10.1109/UKSIM.2010.99

493


978-0-7695-4016-0/10 $26.00 © 2010 IEEE

DOI 10.1109/UKSIM.2010.98

503

mechanism that exchanges portions between strings. Another operation, called mutation, causes sporadic and random alternation of the bits of strings. Mutation also has a direct analogy with nature and plays the role of regenerating lost genetic material. In the proposed genetic algorithm for the data allocation problem, we encode the assignment of each data fragment in a binary representation. For example, if a data fragment is assigned to site 3, then its assignment value is 11. The assignment value of all the data fragments are concatenated to form a binary string. Each binary string then represents a potential solution to the data allocation problem. The fitness of the string is simply the cost of the allocation. The selection mechanism is implemented as a simple proportionate selection scheme: a string with fitness f is allocated f /( f) offspring, where f. is the average fitness value of the population. A string with a fitness value higher than the average is allocated more than one offspring, while a string with a fitness value lower than the average is allocated less than one offspring. Crossover is another crucial operation of a GA. Pairs of strings are picked at random from the population to be subjected to crossover. We use the simple single point crossover approach. Assuming that L is the string length, the algorithm randomly chooses a crossover point that can assume values in the range 1 to L − 1. The portions of the two strings beyond this crossover point are exchanged to form two new strings. The crossover point may assume any of the L − 1 possible values with equal probability. Note that crossover is performed only when a randomly generated number in the range is greater than a prespecified crossover rate pc (also called the probability of crossover); otherwise, the strings remain unaltered. The value of pc lies in the range from 0 to 1. In a large population, pc gives the fraction of strings actually crossed. After crossover, strings are subjected to mutation. Mutation of a bit is to flip a bit. Just as pc controls the probability of a crossover, another parameter, pm (the mutation rate), give the probability that a bit will be flipped. The bits of a string are independently mutated. Genetic Data Allocation Algorithm:

(1) Initialize population. Each individual of the population is a concatenation of the binary representations of the initial random allocation of each data fragment. (2) Evaluate population. (3) no of generation = 0 (4) WHILE no of generation < MAX GENERATION DO (5) Select individuals for next population. (6) Perform crossover and mutation for the selected individuals. (7) Evaluate population. (8) no of generation ++; (9) ENDWHILE (10) Determine final allocation by selecting the fittest individual. If the final allocation

B. The Simulated Evolution Algorithm This variant of the traditional genetic algorithm is called problem-space simulated evolution [9,10]. Simulated annealing, genetic algorithms, and evolutionary strategies are similar in their use of probabilistic search mechanism directed toward decreasing or increasing an objective cost function.

Simulated Evolution Data Allocation Algorithm: (1) Construct the first chromosome based on the problem data and perturb this chromosome to generate an initial population; (2) Use the mapping heuristic to generate a solution for each chromosome; (3) Evaluate the solutions obtained; (4) no of generations = 0; (5) WHILE no of generations < MAX GENERATION DO (6) Select chromosomes for next population; (7) perform crossover and mutation for these set of chromosomes; (8) Use the mapping heuristic to generate a solution for each chromosome; (9) Evaluate the solutions obtained; (10) no of generations = no of generations+1; (11) ENDWHILE (12) Output the best solution found so far;

C. The Mean Field Annealing Algorithm Mean field annealing (MFA) technique [11,12,13] combines the collective computation property of the famous Hopfield Neural Network (HNN) with simulated annealing . MFA was originally proposed for solving the travelling salesperson problem, as an alternative to HNN, which does not scale well for large problem sizes. It has been shown that MFA is a general approach [11,14] that can be applied to various combinatorial optimization problems. for the data allocation problem can be formalized as described below. Using the mean field approximation, the expression for the mean field experienced by spin si j is: In a feasible allocation, each data fragment should be allocated to exclusively one site. Thus, the sum of the spins across each row of the matrix should equal unity. This constraint can be explicitly handled while updating by normalizing each spin

as: Equation (1) ( ) = ′

−1

′=0

−1

=0

−1

′=0

−1

=0′ ′ ′ +

−1

=0

−1

=0

( ) = ′

−1

′=0

−1

=0

−1

′=0

−1

=0′ ′ ′ +

−1

=0

−1

=0

=∑ ′−1

=0

(1)

499499494504

D. Random Neighborhood Search Data Allocation Algorithm

We employ a low complexity but effective optimization technique known as random neighborhood search [15]. The main idea in a neighborhood search algorithm is to generate an initial solution with moderate quality. Then, according to some pre-defined neighborhood, the algorithm probabilistically selects and tests whether a nearby solution in the search space is better or not. If the new solution is better, the algorithm adopts it and starts searching in the new neighborhood; otherwise, the algorithm selects another solution point. The algorithm stops after a specified number of search steps has elapsed or the solution does not improve after a fixed number of steps. The solution quality of a neighborhood search technique relies heavily on the construction of the solution neighborhood.

E. Near Neighborhood Allocation The NNA algorithm is basically a variation of the optimal algorithm [15]. In optimal algorithm, all fragments are initially distributed over the nodes according to any static method but afterwards, any node j, runs the optimal algorithm described as follows for every fragment I, that it stores.

(1) For each (locally) stored fragment, initialize the access counter rows to zero. (S

ij = 0 were i ε fragment indexes and

j ε nodes) (2) Process an access request for the stored fragment (3) Increase the corresponding access counter of the accessing node for the stored fragment. (If node (x) accesses fragment i, set S

ix = S

ix +1)

(4) If the accessing node is the current owner, go to step 2. (i.e Local access, otherwise it is a remote access) (5) If the counter of a remote node is greater than the counter of the current owner node, transfer the ownership of the fragment together with the access counter array to the remote node. (i.e fragment migrates) (If node x accesses fragment i and S

ix > S

ij , send fragment i to node (x))

(6) Repeat from step 2.

The problem of this approach is that if the changing frequency of access pattern for each fragment is high, it will spend a lot of time for transferring fragments to different nodes. So, the response time and delay will be increased.

III. MODULARIZATION ALGORITHM So far, many solutions and algorithms have been presented for resolving data allocation problem each of which possess special characteristics. However, the main purpose of all these methods is to decrease data communication expense. For this purpose, we have to choose the best allocation aimed at reducing this expense .also, in SE and GE algorithms, which

are among the best algorithms ,the major problem pertains to the extra data which have been ignored. There are some ways which help increase the computation speed for data allocation, decrease data redundancy and need a less time for processing . the suggested method in this article , intended to complete the previous methods is engaged in clustering the distributed sites as well as increasing the efficiency data allocation to the clusters and sites within the clusters. The goal of this work is to complete the method intended to classify the distributed sites to the clusters and allocation the data base to these clusters and sites. the clustering of the sites decreases the number of communications within the segment allocation process in the DDB and improves systeme’s function. In fact, this clustering has decreased transfer among sites at the time of conducting the query. Allocating the data segment to the clusters and then to the sites enhances implementation of application programs by minimizing the expense of access to the sites and finally decreases data transferring expense. On the other hand, it increase the speed of data base system through maximizing the number of parallel execution. the algorithm presented in this article, referred to as modularization algorithm requires initially the extraction of communication expense graph. In this graph , the nodes are the sites which involve the data segments and the edges of this graph are the communication expense among these sites . in this algorithm, the sites are modularized in a way that decreases data transferring expense and increases the speed of query processing when a query is executed. We will consider the following hypotheses regarding the communication expense among the sites:

1. the communication expense from site I to the site j is equal to the communication expense from j to the i 2. the communication expense in a site with itself equals to zero. 3. the communication expense is proportionate to the distance. Based on the above hypotheses, an example is illustrated in table 1.

A. Drawing communication cost Graph For drawing communication cost graph we need communication cost beetwen sits. Table 1 presents an example of the communication cost between the six sites in the proposed DDBs measured in communication units[2]. Fig.1

Table 1 Communication costs between DDBs sites

500500495505

B. Computing Laplacian Matrix Laplacian matrix can be calculated from Equation (2) as follows:

L = D −

Matrix D is the graph degree which can be calculated based on the equation (3)form in this way:

di,j =∑ ai,k

nk=1 , =0 , ≠ j

After computing matrix L, Eigen value for Laplacian matrix is computed. From among the Eigen values obtained from the Laplacian matrix for the second smallest Eigen value of the Laplacian matrix the eigenvector is computed.

C. Modularization Based on Eigenvector The eigenvector obtained from the second smallest Eigen value from the Laplacian matrix has both positive and negative values. These values' sing shows this reality that the tips corresponding to the positive values are included in the same cluster and that the negative values are in another cluster Fig.2.

2 = [−0.1455,0.2434, 0.4286 ,−0.4880 ,−0.5214,0.4754 ]

D. Computing Modularization Quality The criterion for evaluating a modularization is MQ and MQ is calculated from Equation(4)

MQ =1

k∑ Ai − 1

k(k−1)2

∑ Ei.jki.j=1 if k > 1k

i=1

Ai if k = 1

In this Equation, we consider criterion for measuring the internal connections for i cluster including Ni nodes and Mi edge weight inside the clusters which have been shown in Equation(5).

Ai =mi

Ni2

We define criterion connection between clusters i and j each including and and nodes. This criterion is shown by Eij.If the amount of connection between clusters is high, it indicates that weakness of the cluster selection. This criterion is calculated from Equation(6)

Eij =o if i = j

eij

2Ni Nj if i ≠ j

In this Equation, is the edges' weight which connects the two clusters i and j. and determine the number of nodes within the clusters i and j respectively. The maximum number of coupling between clusters i and j equals 2 . Thus, ranges between zero and one.

E. Algorithm Duplication The algorithm duplication phase is equal to the number of the requested clusters. It should be noticed that if the number of clusters is not determined beforehand, the algorithm would be executed on the graph and the obtained MQ criterion is compared with the previous phase. If improvement does not occur in this criterion, the modularization would stop.

F. Allocation Fragment to The Clusters and Sites Initially, we allocate the fragments to all clusters using these fragments. The allocation decision value ADV(Tk , Fi ,Cj ) of allocating a fragment Fi issued by the transaction Tk to a cluster Cj , is computed as the result of the difference between CN(Tk , Fi ,Cj ) the cost of not allocating the fragment Fi issued by the transaction Tk to the cluster Cj and CA(Tk , Fi ,Cj ) the cost of allocating the fragment Fi issued by the transaction Tk to the cluster Cj . Cost of allocating a fragment to a cluster: CA(Tk , Fi ,Cj ) The cost of allocating the fragment Fi to the cluster Cj is computed as the sum of the following costs: local retrievals, local updates, space, remote update, and remote communication.

(2)

(4)

(3)

(5)

(6)

Fig.1 communication cost Graph Fig.2 modularization based on eigenvector

501501496506

Cost of not allocating a fragment to a cluster: CN (Tk , Fi ,Cj ) The cost of not allocating the fragment Fi to the cluster Cj is computed as the sum of the cost of local retrievals and sum of the cost of remote retrievals. Allocation decision value for a cluster: ADV(Tk , Fi ,Cj )We define the allocation decision value ADV (Tk ,Fi ,Cj ) that allocates the fragment Fi issued by the transaction Tk to the cluster Cj as a logical value and compute it as follows Equation(7).

, , = ( ) =1, ( , , ) < ( , , )0, ( , , ) ≥ ( , , )

If the allocation decision value is true (1), then the fragment is permanently allocated to the cluster Cj . On the other hand, if the allocation decision value is false (0), then the fragment is cancelled from the cluster Cj .

IV. EVALUATING THE SUGGESTED ALGORITHM There are various criteria to evaluate the allocated algorithms in DDB. To evaluate the suggested algorithm, we consider some criteria including data transferring expense, and data allocation expense to the sites.

In the example presented with modularization algorithm, we came to three groups. The first group includes sites(2,3,6) the second group include sites(5,1) and the third group involves site 4. Now we will compare the communication expense among these sites with and without clustering. In the clustering of sites into clusters, the number of communications and consequently the expense of communications are decreased. Figure2. Illustrates the rate of expense reduction with and without clustering for the mentioned example .We can increase the efficiency of the system by eliminating the extra and redundant parts from the base as well as by increasing the certainty with regard to several copies from one part. Also the number of allocations to the clusters and sites as shown is the figures 3 is decreased.

A. Performance Evaluation of Clustering Number of communications

We think that grouping sites into clusters reduces the number of communications which will result in minimizing the communication costs that are needed in further processes through fragment allocation phase. In the example introduced in article where we simulate a network of six sites, each site is communicated with the other 5 sites, so the initial total number of communications is (6 * 5 = 30). After clustering the sites into three clusters, each cluster is communicated with the other 2 clusters taking into account the communication within the cluster itself because it contains sites with different communication costs; in this case the total number of communications is 6. Therefore, a high-performance is achieved by using our clustering method, which reduces the number of communications from 30 to

6 and enhances the system progress by 80% as described in the Equation(8).

Improvement percentage =

Reduced number of communicationsTotal number of communications

=2430

B. Evaluation Allocating Fragments to Clusters and Sites Initially, allocating fragments to all clusters having applications requesting these fragments generates a total of 20 allocations, while 8 allocations are generated after applying our fragment allocation Fig.3 and Fig.4 . In this case the system performance improved by an average of 60% as defined in the Equation(9).

=

=

1220

C. Evaluation Cost of communications. We used the average communication cost between clusters and their sites because the time complexity for computing the average communication cost is less when other methods are used which depend on sorting sites to find the least communication cost. In this case the initial communication cost between the sites in cluster 1 and the sites in cluster 2, for example, which is equal to 32 (7+ 2+ 7+ 7+ 2+ 7 from Table 1 above), is replaced by the average of the communication cost between cluster 1 and 2 (C1, C2) and between cluster 2 and cluster 1 (C2, C1) which is equal to 10.6 (5.3 + 5.3) Fig.5 . Thus, the communication cost is reduced from 32 to 10.6 and the performance improvement is computed in the Equation(10).

=

=

10.632

(7)

Fig.3 Fragments Allocation to Clusters

(8)

(9)

(10)

502502497507

REFERENCES

[1] . Ahmad. I , Karlapalem. K, Evolutionary Algorithms for Allocating Data in Distributed Database Systems. Distributed and Parallel Databases, 11, 5–32, 2002. [2]. Hababeh.I.O, Ramachandran.M , Bowring.N, A high-performance computing method for data allocation in distributed database systems, J Supercomput 39:3–18.2007. [3] Kosar. T.k , Livny. M, A framework for reliable and efficient data placement in distributed computing systems . J. Parallel Distrib. Comput. 65 . 1146 –1157.2005. [4] Cheng C, LeeW,Wong K (2002) A genetic algorithm-based clustering approach for database partitioning.IEEE Trans Syst Man Cybern—Part C: Appl Rev 32(3) [5] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley: Reading, MA, 1989. [6]. S. Hurley, “Taskgraph mapping using a genetic algorithm: A comparison of fitness functions,” Parallel Computing, vol. 19, pp. 1313–1317, 1993. [7]. S.W. Mahfoud and D.E. Goldberg, “Parallel recombinative simulated annealing:Agenetic algorithm,” Parallel Computing, vol. 21, pp. 1–28, 1995. [8] M. Srinivas and L.M. Patnaik, “Genetic algorithms: A survey,” Computer, vol. 27, no. 6, pp. 17–26, 1994. [9]. K. Karlapalem and M.-P. Ng, “Query-driven data allocation for distributed database systems,” in Proceedings of 8th International Conference on Database and Expert Systems Applications, Sept. 1997, pp. 347–356. [10] Y.-K. Kwok, K. Karlapalem, I. Ahmad, and M.-P. Ng, “Design and evaluation of data allocation algorithms for

distributed multimedia database systems,” IEEE Journal on Selected Areas in Communications, vol 14, [11] D.E. Van den Bout and T.K. Miller, “Graph partitioning using annealed neural networks,” IEEE Transactions on Neural Networks, vol. 1, no. 2, pp. 192–203, 1990. [12] T. Bultan and C. Aykanat, “A new mapping heuristic based on mean field annealing,” Journal of Parallel and Distributed Computing, vol. 16, pp. 292–305, 1992. [13]. C. Peterson and B. Soderberg, “A new method for mapping optimization problems onto neural networks,” International Journal of Neural Systems, vol. 1, no. 3, 1989. [14]. T. Bultan and C. Aykanat, “A new mapping heuristic based on mean field annealing,” Journal of Parallel and Distributed Computing, vol. 16, pp. 292–305, 1992. [15] C.H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Prentice-Hall: Englewood Cliffs, NJ, 1982.

Fig.5 Performance Evaluation of Clustering (Communication Cost)

Fig.4 Fragments Allocation to sites

503503498508

Documents

Storage Allocation