71
The Customized Database Fragmentation Technique in Distributed Database Systems Mohammed Ibrahim Shareef Aus Wail Al-Rawi MASTER THESIS 2011 INFORMATICS

The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

The Customized Database Fragmentation Technique in

Distributed Database Systems

Mohammed Ibrahim Shareef

Aus Wail Al-Rawi

MASTER THESIS 2011

INFORMATICS

Page 2: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Postadress: Besöksadress: Telefon:

Box 1026 Gjuterigatan 5 036-10 10 00 (vx)

551 11 Jönköping

The Customized Database Fragmentation Technique in

Distributed Database Systems

Mohammed Ibrahim Shareef

Aus Wail Al-Rawi

Detta examensarbete är utfört vid Tekniska Högskolan i Jönköping inom

ämnesområdet informatik. Arbetet är ett led i masterutbildningen med inriktning

informationsteknik och management. Författarna svarar själva för framförda åsikter,

slutsatser och resultat.

Supervisor: Anders Cartensen

Examinator: Vladimir Tarasov

Omfattning: 30 hp (D-nivå)

Datum:

Arkiveringsnummer

Page 3: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Abstract

iii

Abstract

In current age, various companies are using a centralized database system for daily

business transactions in different domains. Some critical issues have been observed

related to the complexity, maintenance, performance and communication cost of data

in centralized data repository for query processing, according to the demand of end-

users from different locations. So, different enterprises are striving to implement

efficient distributed database systems in their business environments for scalability.

The distributed database architecture covers different factors such as transparent

management system, replication, fragmentation and allocation etc. This dissertation

focuses on database fragmentation and techniques which are useful for performing

database fragmentation.

The objective of this research is to investigate efficient algorithm and technique for

database fragmentation in distributed environment. We proposed a customized ISUD

(Insert, Select, Update, Delete) technique after comparative study of the best suitable

techniques, which is selected for implementation purpose. The functionality of the

customized ISUD technique helps to get the precedence of the attribute of a relation

horizontally in database from various sites or location.

The practical objective of this dissertation is to design the architecture and develop,

implement customized ISUD (Insert, Select, Update, Delete) user interface, and to test

the selected algorithm or technique by using the interface. We used C#.Net as a

development tool. This user interface accepts ISUD frequency as an input and

produces ALP (attribute location precedence) values as output. We have incorporated

design science research (DSR) method for customized ISUD technique development.

This customized ISUD technique can be considered as a foundation to implement

horizontal database fragmentation in distributed environment, so that the database

administrator can take a proper decision for allocating the fragmented data to various

sites at initial state of distributed database design.

Page 4: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Abstract

iv

Abstrakt

I dag använder olika företag ett centraliserat databassystem för

dagliga affärstransaktioner i olika domäner. Vissa kritiska frågor har observerats i

samband med komplexiteten, underhåll, prestanda och kommunikations kostnader

av data i centraliserad data arkiv för behandling av förfrågningar, enligt efterfrågan

på slutanvändarna från olika platser. Så, olika företag strävar efter att implementera

effektiva distribuerade databassystem i sina affärsverksamheters miljöer för

skalbarheten. Den distribuerade databas arkitekturen omfattar olika faktorer

såsom transparent ledningssystem, replikering, fragmentering och allokering etcetera.

Denna avhandling fokuserar på databas fragmentering och tekniker som är

användbara för att utföra databas fragmentering.

Syftet med denna forskning är att undersöka effektiv algoritm och teknik för

databas fragmentering i en distribuerad miljö. Vi föreslog en

skräddarsydd ISUD (Insert, Select, Update, Delete) teknik efter en jämförande

studie av de bästa lämpliga teknikerna som har valts för genomförandets ändamål.

Funktionaliteten hos den anpassade ISUD tekniken hjälper till att få

företräde till attribut för en relation horisontellt i databasen från olika platser.

Den praktiska Syftet med denna avhandling är att utforma arkitektur och utveckla,

genomföra anpassade ISUD (Infoga, Välj, uppdatera, ta bort) användargränssnitt,och

att testa den valda algoritmen eller teknik med hjälp av gränssnittet.Vi

använde C#. Net somett utvecklingsverktyg. Dettaanvändargränssnitt accepterar ISU

D frekvens som indata och producerar ALP (attribute location precedence) värden

som utdata. Vi har integrerat design forskning (DSR) metoden för kundanpassad

ISUD teknik utveckling. Denna skräddarsydda ISUD tekniken kan betraktas

som en grund för att implementera horisontell databas fragmentering i distribuerad

miljö, så att databas administratören kan ta ett riktigt beslut för att allokera

fragmenterade data till olika platser vid första läget i distribuerad databas design.

Page 5: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Acknowledgements

v

Acknowledgements

With the immense pleasure we take this opportunity to thank one and all who have

helped in making this project possible.

First of all, we like to thank almighty God, the Most Beneficent, and the most

Merciful. We like to thank Jönköping University for giving us the opportunity to work

on a thesis as a part of our curriculum. We also like to thank our supervisor Anders

Carstensen for his advices, support and facilitator role throughout this final project.

We would also like to thank our examiner and professor Dr.Vladmir Tarasaov for his

valuable suggestion and guidance throughout our thesis. And we also like to thank Mr

Markus Milerup, representing for jordbruksverket(Swedish Department of

Agriculture) Sweden, for providing the information of the company problems in the

scope of this thesis project. At last we would like to thank our family and friends who

gave us social and moral support in order to achieve this thesis.

Page 6: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Key words

vi

Key words

Distributed database, Database Fragmentation, Attribute Locality

precedence, Customized ISUD.

Page 7: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Contents

vii

Contents

1 Introduction ............................................................................. 13

1.1 BACKGROUND ........................................................................................................................... 13 1.1.1 Contribution of the thesis ................................................................................................ 14

1.2 CASE STUDY .............................................................................................................................. 15 1.2.1 Swedish Board of Agriculture ......................................................................................... 15 1.2.2 Case Study for Testing Purpose or for Evaluation of Proposed Technique ................... 15

1.3 PURPOSE/OBJECTIVES ............................................................................................................... 15 1.3.1 Research Question .......................................................................................................... 15 1.3.2 Theoretical Purpose........................................................................................................ 16 1.3.3 Practical Purpose ........................................................................................................... 16 1.3.4 Assumption...................................................................................................................... 16

1.4 LIMITATIONS ............................................................................................................................. 16 1.5 THESIS OUTLINE ........................................................................................................................ 17

2 Theoretical Background ........................................................... 18

2.1 GENERAL DESCRIPTION OF DISTRIBUTED DATABASE ............................................................... 18 2.1.1 What is a Distributed Database System? ........................................................................ 18 2.1.2 Application of Distributed database technology ............................................................. 19

2.2 DISTRIBUTED DATABASE ARCHITECTURE ................................................................................ 20 2.2.1 Architectural Models for Distributed database system ................................................... 21

2.3 UNSOLVED PROBLEMS IN DDBS ............................................................................................... 22 2.3.1 Distribution design ......................................................................................................... 22 2.3.2 Network scaling problems:- ............................................................................................ 23

2.4 DISTRIBUTION DESIGN PROBLEMS ............................................................................................ 23 2.4.1 The Complexity of the Problems ..................................................................................... 23 2.4.2 Interdependencies with Query Optimization................................................................... 24 2.4.3 Improvised Solution for the problems mentioned ........................................................... 24

2.5 INITIAL DESIGN APPROACH FOR DISTRIBUTED DATABASE DESIGN .......................................... 24 2.5.1 Requirements analysis .................................................................................................... 25 2.5.2 Conceptual project.......................................................................................................... 25 2.5.3 Logical project ................................................................................................................ 26 2.5.4 Distribution project ........................................................................................................ 26 2.5.5 Physical project .............................................................................................................. 26

2.6 FRAGMENTATION IN DISTRIBUTED DATABASE DESIGN ............................................................ 26 2.6.1 Horizontal Fragmentation .............................................................................................. 27

2.7 PREVIOUS WORKS ON FRAGMENTATION IN DDBS .................................................................... 30 2.7.1 Database Fragmentation Technique by Shahidul Islam Khan and Dr. A. S. M. Latiful

Hoque 31 2.8 GENERIC FIVE STEPS FOR DATA FRAGMENTATION AND ALLOCATION IN DISTRIBUTED

DATABASE SYSTEMS .......................................................................................................................... 35 2.8.1 Collection of Global Relations ....................................................................................... 36 2.8.2 Frequently Asked Question (FAQs) ................................................................................ 36 2.8.3 Data Allocation Goals .................................................................................................... 36

3 Research Method ..................................................................... 38

3.1 CATEGORIES OF RESEARCH METHODS ...................................................................................... 38 3.2 HIGH LEVEL RESEARCH METHOD FOR DATA INQUIRY ............................................................. 39 3.3 LOW LEVEL METHOD FOR RESEARCH DESIGN .......................................................................... 39

3.3.1 Constructive Research .................................................................................................... 39 3.3.2 Phases of Constructive Research .................................................................................... 40

3.4 LOW LEVEL DESIGN RESEARCH METHODOLOGY (DSR) FOR IMPLEMENTATION ...................... 41 3.4.1 Steps of the Design Science Research Method (DSR) ..................................................... 41

Page 8: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Contents

viii

4 Results...................................................................................... 45

4.1 THEORETICAL RESULTS ............................................................................................................ 45 4.2 PRACTICAL RESULTS ................................................................................................................ 46

4.2.1 Proposed 5-Layer Architecture ................................................................................................ 46 4.2.2 Testing the Proposed Algorithmic approach .................................................................. 55

5 Discussion ................................................................................ 62

5.1 CONTRIBUTION OF THE WORK ................................................................................................. 62

6 Conclusion and Future Work ................................................... 64

6.1 CONCLUSION ............................................................................................................................. 64 6.2 FUTURE WORK.......................................................................................................................... 65

7 References ................................................................................ 66

8 Appendix: ................................................................................ 69

8.1 CASE STUDY APPLICATION ....................................................................................................... 69 8.2 LOG FILE CODE FOR GENERATING CUSTOMIZED ISUD MATRIX TABLE .................................... 69 8.3 ALGORITHM FOR ISUD APPLICATION INTERFACE ..................................................................... 70

Page 9: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

List of Figures

ix

List of Figures

FIGURE 1: DATABASE MANAGEMENT SYSTEM IMPLEMENTATION

ALTERNATIVES[1] ............................................................................................... 21

FIGURE 2: STAGES OF THE TOP-DOWN APPROACH IN DISTRIBUTED

DATABASES [3] [5] ................................................................................................ 25

FIGURE 3. BLOCK DIAGRAM OF THE SYSTEM[24] ............................................... 31

FIGURE 4: ALGORITHM FOR FRAGMENTATION[24] .......................................... 32

FIGURE 5.ALP-TABLE-CONSTRUCTION PSEUDO-CODE [24] ............................ 33

FIGURE 6:RESEARCH DESIGN METHOD [7] .......................................................... 38

FIGURE 7: CONSTRUCTIVE RESEARCH METHODOLOGY FOR RESEARCH

DESIGN ................................................................................................................. 41

FIGURE 8: THE GENERAL METHODOLOGY OF DESIGN SCIENCE

RESEARCH [22] ..................................................................................................... 42

FIGURE 9: 5-LAYER ARCHITECTURE FOR PROPOSED FRAGMENTATION

TECHNIQUE ......................................................................................................... 48

FIGURE 10: APPLICATION OF A CASE STUDY ..................................................... 49

FIGURE 11: DATABASE OF CASE STUDY APPLICATION ................................... 50

FIGURE 12: CISUD MATRIX TABLE .......................................................................... 51

FIGURE 13: USER INTERFACE FOR CISUD APPLICATION. ................................ 52

FIGURE 14: INTERFACE FOR SETTING AND GETTING THE PREDICATE

SET FOR INDIVIDUAL HIGHEST ATTRIBUTE. ........................................... 53

Page 10: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

List of Figures

x

FIGURE 15: PREDICATE SET FOR HIGHEST ATTRIBUTE PRECEDENCE AT

INDIVIDUAL SITE ............................................................................................... 54

FIGURE 16: ALLOCATION OF FRAGMENTS .......................................................... 54

FIGURE 17: ISUD USER INTERFACE FOR TOTAL COST OF ATTRIBUTE

FROM ALL SITES .................................................................................................. 55

FIGURE 18: RESULTS RETRIEVE FOR TOTAL ALP(ATTRIBUTE LOCALITY

PRECEDENCE) VALUE FROM THREE SITES. ................................................ 56

FIGURE 19: ISUD USER INTERFACE FOR INDIVIDUAL COST OF

ATTRIBUTE FROM INDIVIDUAL SITES .......................................................... 57

FIGURE 20: INDIVIDUAL ALP RESULTS FROM INDIVIDUAL SITES ................. 57

FIGURE 21: ALLOCATION OF DATA TO DIFFERENT SITES .............................. 58

FIGURE 22: ISUD INPUT VALUES (1) ....................................................................... 58

FIGURE 23: INTERPRETATION OF RESULT 1 ........................................................ 59

FIGURE 24: GRAPHICAL INTERPRETATION OF RESULT 1 ................................ 59

FIGURE 25: ISUD INPUT VALUES (2) ........................................................................ 60

FIGURE 26: INTERPRETATION OF RESULT 2 ........................................................ 60

FIGURE 27: GRAPHICAL INTERPRETATION OF RESULT 2 ................................ 61

FIGURE 28: BHARAT TRANSPORT SERVICE APPLICATION (CASE STUDY) .... 69

Page 11: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

List of Tables

xi

List of Tables

TABLE 1: PROJECT S1 .................................................................................................. 29

TABLE 2: PROJECT S2 .................................................................................................. 29

TABLE 3: COMPARISON FRAMEWORK OF DIFFERENT TECHNIQUES WITH

RESPECT TO KEY CHARACTERISTICS ............................................................ 46

Page 12: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

List of Abbreviations

xii

List of Abbreviations

DDBS: Distributed Database Systems

DDBMS: Distributed Database Management Systems

ALP: Attribute Locality Precedence

CISUD: Customized (Insert, Select, Update, Delete)

DSR: Design Science Research

HF: Horizontal Fragmentation

Page 13: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Introduction

13

1 Introduction The introductory section focuses on the selection of research domain and defines the

importance of the research and also mentions the objective and limitation of the

research work in this dissertation. This section also includes the background of the

problem domain and mention what are the potential problems in the area of research

under literature investigation.

1.1 Background Distributed database systems are becoming more and more important for sharing and

managing information within large corporate and companies or organizations. The

emergence of distributed database management systems (DDBMS) is based on

maturing of database management systems (DBMS) with significant development in

computer networks and distributed computing technologies [1]. The concept of

distributed database (DDB) is defined as a collection of multiple, logically interrelated

databases distributed over a computer network [1]. The control of distributed database

activities are governed by distributed database management systems (DDBMS). “A

distributed database management system (DDBMS) is the software system that

permits the management of the distributed database and makes the distribution

transparent to the users” [1, p.3].

In our discussion, it is important to take a brief overview about various distributed

database systems. These distributed database systems are categorized into different

forms such as homogenous distributed database systems (Home-DBS) and

heterogeneous distributed database systems (Hetro-DBS) [2]. The homogenous

distributed database defines the same data models, schemas and databases but the

heterogeneous distributed database depicts different characteristics like schema

integration, distributed query processing, distributed transaction management,

administrative functions and coping with different types of heterogeneity [2]. The

heterogeneity factor can also be involved with respect to computer hardware,

operating systems, communication links, data models, protocols and different

database management systems [2].

The importance of distributed and parallel processing in database management

systems (DBMS) is taken as an efficient way of improving performance of

applications that manipulate large volumes of the data in organization [8]. This design

of distributed database is used to achieve various tasks such as removing irrelevant

data accesses during the execution of queries from the various locations and reducing

the communication cost of data shared among various sites. The distribution design

also involves making decision in data fragmentation and placement across different

sites in distributed environment [8].

Distributed database helps to allocate data as fragmented, replicated and distributed

[9] over the intranet or internet within organization and across the organization. The

client/server architecture provides a platform where a number of client’s machines can

access to a single database server and help to distribute, allocate the data across

multiple sites that have to communicate with each other when responding to the user’s

queries and executing remotely transactions [1].

Page 14: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Introduction

14

Distributed database design involves some issues [5] and these issues complicate

distributed database design architecture. In distributed database system, it is often

required to allocate data as fragmented, replicated and decentralized [9]. The

fragmentation phenomenon highlights that how relation is divided into several parts

and stored at several sites. Relation can be fragmented in different form as horizontal,

vertical, or mixed fragmentation [9]. The term replication means the copies of the

same data are stored at several sites. These copies may be considered as fragment of

the relation or whole relation. For replication of the data, many data update operation

problems have been observed [9]. The term decentralized database is referred to

distribution of the data over the LAN/WAN environment where the relation is

distributed or stored at different sites [9].

Various approaches [10] [11] have been proposed for database partitioning and

fragment allocations in distributed databases. The design of distributed database is

used to enhance the performance of applications by minimizing the irrelevant data

accessible from different applications and by minimizing the cost of transferring the

data when processing the applications at different sites [12].

This dissertation focuses on different strategies and ways [12] for propagating data

over the network, between the sites within an organization or several organizations.

These strategies are based on fragmentation [1]. The fragmentation is basically

applied to relational database schema in the form of horizontal fragmentation and

vertical fragmentation [1]. The main advantage is to introduce fragmentation concept

in the distributed database system architecture and to enable the placement of data in

close proximity to its place of use, which helps to reduce transmission cost and also

the size of the relations that are involved in user queries [1].

1.1.1 Contribution of the thesis

The contribution of this thesis is to investigate the algorithms for database

fragmentation, by using comparative study framework of different techniques which

have been proposed by different researchers which explained in detail in section 5.1.

The other contribution of this thesis is, to design the architecture and implement the

customized ISUD technique which is taken from [24], which is explained in detail in

section 4. The main contribution of our thesis is the proposed 5-layered architecture

which enhances the features; the creation of individual ALP table from various

individual sites, because in [24] it only emphasis on summarized total cost of attribute

locality precedence (ALP) from all the sites, the detail explanation can be seen in

section 5.2.1 and 5.2.2.

Page 15: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Introduction

15

1.2 Case Study

1.2.1 Swedish Board of Agriculture

This research work is related to Swedish Department of Agriculture Organization

which has a centralized database system, providing the infrastructure to the end-users

in order for them to access data remotely all over Sweden. End-users of the

organization can easily access the information from the centralized database system

over the limited internet (extranet) by using internet authentication service (IAS). This

centralized database which is developed in oracle at different sites for a homogeneous

environment. In the organization, many resources are engaged for the maintenance of

centralized data for their dissemination within company and for accessing different

services according to end-user queries, so cost factor is high for quality assurance is

concerned. The organization’s centralized database system contains the data which

come from different relational databases such as customer’s database,

administrative/employee database etc.

1.2.2 Case Study for Testing Purpose or for Evaluation of Proposed

Technique

In order to test the technique, developed in this thesis, a separate case study has been

initiated. In the case study an information system previously developed for Bharat

transport service is used. Bharat transport service is Indian logistic company situated

in Hyderabad, India. This software is offered with different applications such as

vehicle’s billing information, daily loading reports, vehicles payment details and the

generation different reports. For the purpose of testing our technique, only the billing

information application has been used. This application has many functionalities such

as retrieving the data according to selected bill numbers and name of the employee , it

save the information in the database, it even helps to update and delete the

information. The application also utilizes the DML (Dynamic Manipulation

Language) operations such create, update, delete, select etc. Due to availability of

DML operation, we have selected this application to test our technique.

1.3 Purpose/Objectives

1.3.1 Research Question

After analytical assessment from the literature review [5], it is realized that there are

some issues in the distributed database development that are subjected to database

fragmentation. In distributed database design architecture, we have tried to discuss the

following issues in our thesis work.

Q.1. What algorithms do exist in order to uniformly fragment the relations in a

distributed database?

Q.2. How to design the architecture of designated algorithm from Q1?

How to implement and test the proposed algorithmic approach?

Page 16: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Introduction

16

1.3.2 Theoretical Purpose

This dissertation contributes to the field of distributed database and provides one of

the solutions that, how traditional centralized database system is transformed into

distributed database system. So the theoretical purpose of this research is to address

data fragmentation problems and investigate some efficient algorithms with efficient

techniques for horizontal database fragmentation in distributed environment.

1.3.3 Practical Purpose

The practical purpose of the research work was to design the architecture of a

carefully selected algorithm (describe in [24]) in real time scenario (using the case

study of Bharat Transport Service), and to implement and test the proposed

algorithmic approach. The practical purpose of this study helps the database

administrator or end-users to take a proper fragmentation decisions at initial stage of

distributed database system by using ISUD (Insert, Select, Update, Delete) matrix

table which is shown in detail in section4.

1.3.4 Assumption

Assumption is based on those things which are already developed. According to the

our research work

The databases of the case study which is used in this research work has

already made before developing distributed database systems architecture for

testing is concerned.

Different techniques which are already discussed in this research work are

taken from the literature review for fragmenting the database, which support

for creating distributed data in distributed environments.

1.4 Limitations

The limitation is the way to limit the scope of the study. The limitations also identify

certain set of boundaries and functionalities which are being used in this research

work.

1. Our research work is focused to implement the algorithm [24] for distribute

database using horizontal fragmentation technique.

2. We are not concerned about the vertical fragmentation and mixed

fragmentation.

3. We are also not concerned about the allocation of the data in distributed

environment at different sites.

Page 17: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Introduction

17

1.5 Thesis outline

The first chapter highlights the introduction of the research work, influence and

purpose of the research work and identify the problems, assumptions, limitation of the

research work. The second chapter express the previous approaches, techniques and

strategies how to develop fragmentation of distributed database. The third chapter

describes the methodologies, how we can conduct the research work and implement

the fragmentation in distributed database architecture. The fourth chapter defines the

design and implementation of algorithm using horizontal fragmentation technique.

The fifth chapter is about the results and analysis part of the research work. The sixth

chapter is about conclusion and future work related to the discussion.

Page 18: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

18

2 Theoretical Background The distributed database is based on different architecture layers which describe a

logical collection of data from inter-lined databases [2]. Before going into the detailed

discussion, we have to know the basic fundamentals of distributed databases theory.

Below are some of the basic definitions related to database management systems.

A database “is a collection of data, typically containing the information about one or

more related organizations” [33.p.11].

A database management system (DBMS) “is a software package designed to store

and manage databases” [33.p.11].

A data model “is a collection of concepts for describing data. Data model in

database vs. type system in programming language” [33.p.12].

A schema “is a description of a particular collection of data, using the given data

model. Schemas in database vs. types in programming language” [33.p.12].

There are different types of data model for each database [33]. The models shown

with example. Data models: The relational data model, most commonly used

Relational database systems, e.g. SQL server, Oracle, Sybase. Object-oriented data

model: Object Store, O2. Object-relational model: UniSQL, Informix Universal

Server, Semi-structured data model, XML [33].

2.1 General Description of Distributed Database In the real world scenario people have the need to access different company

databases, whether it may be employees, customers, potential customers, vendors or

suppliers of any kind. Until now the companies have been able to have their databases

concentrated at a single server sites to be accessed worldwide by means of

telecommunication networks and internet [5]. Although using a centralized database

systems the companies have been able to disseminate the data within organization in a

very structured manner. But due to the incorporation of new business needs and

demands and the adoption of new database architectures for scalability, they need to

adopt new ways to propagate the data over distributed locations. There are many

benefits of using a distributed database system as explained in the following section.

However there are also associated complexities, some of them described in section 2.3

and 2.4.

2.1.1 What is a Distributed Database System?

There exists several different definitions of DDS (Distributed database systems)

defined by different authors. A basic and generic definition of a DDS is: A distributed

database systems is a “collection of multiple, logically interrelated database

distributed over a computer network” [1.p.3]. A DDBMS (distributed database

management systems) is also defined as the “software system that permits the

management of the DDS and makes the distribution transparent to the users” [1.p.3].

Page 19: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

19

2.1.2 Application of Distributed database technology

Many advantages, of different perspectives have been listed for DDBSs. In the

following sections some fundamentals promises of DDBSs are described by the

Tamer Ozu [1].

2.1.2.1 Transparent Management of Distributed system:-

Distributed database technology is planned to extend the concept of data

independence to environments in which data is distributed and replicated over a

number of machines connected by a network [13]. Data independence is provided by

several forms of transparency network and, therefore distribution transparency,

replication transparency, and fragmentation transparency. Transparent access to data

separates a system’s higher level semantics from lower level implementation issues

[13].

Transparent system hides the information code of implementation from the users. The

actual benefit of transparent DBMS is that it handles the complex applications

development. This could be explaining more by an example as given by Tamer ozu

[1].

Let suppose an example of Jonkoping University which has different schools like

Engineering School(JTH), Jonkoping International business school(JIBS), and Health

science school, This university run the projects at each office sites and maintain a

database of their employees, Program information and related data etc. As per the

assumption the database used is relational so it can store the information in two

relations i.e. EMP(ENO,ENAME,TITLE) and PROG(PNO,PNAME,PROGDETAIL),

thus we add a third relation to store salary information of employee as

SAL(TITLE,AMT) and a fourth relation as ASG which is used as employees assigned

to which program for what duration and with what responsibility,

ASG(ENO,PNO,RESP,DUR), if this data is stored in a centralized DBMS and if we

want to find the names and the employees who worked on a project for more than 6

months, we would retrieve by the following SQL query[1].

Example

SELECT Ename,Amt FROM Emp,Asg,Sal

WHERE

Asg.Dur > 6 AND

Emp.Eno = Asg.Eno

AND Sal.Title = Emp.Title

From the above example we depict that the query get the results from centralize

database systems, as per the tables(relation) mentioned in where condition which is

transparent to the user. However if we make the centralize nature of the university

database to the distributed nature , it can be done through this circumstances that is to

localize data such that data of the employees of JTH school is stored at JTH office,

data of JIBS school are stored at JIBS office and so on. And the same can be applied

to other relations program and salary information. Therefore what we are intended to

do here is partitioning the relations and storing each partition to different sites, which

is known as Fragmentation. Thus the fully transparent access means that the user can

Page 20: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

20

use the same query as used in the above example without any concern about the

fragmentation, location of data , as it rely on the system to resolve this issues[1].

There are different types of transparencies in distributed environment are explained by

the Tamer Ozu in [1] , they are Fragmentation Transparency, Network Transparency,

Replication Transparency etc. As our research work is concern with fragmentation so

we are going to explain about the Fragmentation Transparency.

2.1.2.2 Fragmentation Transparency

The actual form of transparency that needs to be talked about is fragmentation

transparency in distributed database system. In our proposed technique in chapter 4

we have justify the fact that fragmenting the relation horizontally into smaller

fragments is possible and treating each fragment as separate database or relation. The

motive of the fragmentation is to increase the performance, availability, and reliability

[1]. Generally fragmentation is of two types i.e. Horizontal fragmentation (HF) and

Vertical fragmentation (VF), In HF each relation is divided into sub relation and each

sub relation will have the subset of rows (tuples), whereas in VF the relations are

divided into sub relations and each sub relation is defined on a subset of the columns

(attribute) of the original relation.

When the relations of database is fragmented the user queries should be handle

according to the sub relations of database ,this issue can be handle by finding a query

processing strategy based on fragments rather than the relations [1]. Thus we can say

that these queries are converted from global queries to several fragment queries.

Therefore one of the fragmentation transparency issues is dealing with the one of

query processing [1].

2.1.2.3 Availability and Reliability

Availability can be defined as the probability that the system can be up continuously

until the time period given [12]. Whereas Reliability is defined as the probability that

the system will be up at a specified time [12], this improves with the DDBS. In the

centralized DBS, if one of the sites goes down then the entire system goes down

whereas in the DDBS it effects only with the site which is down and the other sites or

the system will not be affected. And even with the replicated data at different sites, it

effects is minimized [12].

2.1.2.4 Improved Performance

If there are very large database which is then distributed into different number of sites,

then the local subset of the DB will be lot smaller which tends to improve the size of

the transaction and the processing time. It even improve the performance of response

time for the transaction which access more than one site thus the processing can be

performed parallel [12].

2.2 Distributed Database Architecture A distributed database system allows applications to access data from local and

remote databases. In a homogenous distributed system, each site has same databases.

In a heterogeneous distributed system, at least one of the databases is a non-related

database.

Page 21: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

21

2.2.1 Architectural Models for Distributed database system

There are some ways by which DDBMS can be constructed by using the following

classification which organizes the system; they are differentiated with respect to (1)

The Autonomy, (2) Distribution, (3) Heterogeneity in figure-1 [1].

Figure 1: Database Management System Implementation Alternatives[1]

(1) Autonomy. It refers to the distribution of control and not exactly the data and

it ensures single DBMSs which can operate independently [1]. It is a function

of certain factors such as the systems that exchange information, which can

independently execute transactions, or are allowed to modify the system. It

demands some requirements that have to be fulfilled they are as follows [1].

According to Gligor and Popsescu-Zeletin [1]

(i) Local Operations are not affected by participation in global multi

distributed database system.

(ii) Optimization and Query Processing also not affected by global query

access.

(iii) System consistency is not well cooperated when there is any change in

the database i.e. adding or removing DBs from global database.

According to Du and Elmagarmid [1][13]

(i) Design autonomy: All the databases use data models and transaction

management they need.

(ii) Communication autonomy: Every Databases are responsible and

decide which database to provide to other Dbs.

(iii) Execution autonomy: Each DBMS can be executed according to the

way it wants.

There are some classifications of autonomy which can be specified as follows.

-Tight integration: - It has single image of DB for all users who want to share

the information.

Page 22: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

22

-Semiautonomous systems: - They consist of DBMSs which determines which

part of database should be shared and they modified the information for

communicating with each other.

(2) Distribution. It refers to the physical Distribution of data and different

software components over multiple sites, whereas the user can see the data as

transparent and as logical pool [1]. The distribution of data can be distributed

into two classes Client/server distribution and peer-to-peer distribution [1].

Client/server distribution: - It provides data management service at the

server side, the data is stored primarily, while the clients focus on getting the

data whenever needed and it also generate requests [1].

Peer-to-Peer distribution: - In this distribution the data is fully distributed

and there is no connection between client and servers, every machine has

functionality of DBMS and can communicate with other machines to execute

queries and transactions. Each server, client and each DBS at a site maintains a

portion of the database [1].

(3) Heterogeneity:-It occurs in various forms in distributed systems, like

hardware heterogeneity, Communications, and Operating system. In relation to

database it has data model, data format, query language, transaction

management algorithms. If accessing with other remote DBSs than there is

need of conversions [1].

2.3 Unsolved problems in DDBS

2.3.1 Distribution design

Distributed database design methodology varies depending upon the system

architecture. For tightly integrated distributed databases, the design process will be the

top-down from requirements analysis and logical design of the global database to

physical design of each local database [13]. For distributed multi-database systems,

the design process is bottom-up and involves the integration of existing databases

[13].

The step of interest in the top down process is distribution design describe by [13],

which involves designing local conceptual schemas by distributing global entities

over the sites of the distributed system. The global entities are then specified within

the global conceptual schema. By taking consideration of relational model, both the

global and the local entities are relations, therefore distribution design will maps

global relations to local ones [13]. One of the most important research issues that

require attention is the development of a practical distribution design methodology

and its integration into the general data-modeling process [13].

The two main aspects of distribution design are fragmentation and allocation. In

Fragmentation each global relation is partition into the set of fragment relations [13].

Whereas Allocation focuses on the (possibly replicated) distribution of these local

relations across the distributed system’s sites [13]. Therefore the research on

fragmentation has focuses on horizontal (or selecting) and vertical (or projecting)

fragmentation of global relations [13]. There are so many algorithms proposed for

Page 23: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

23

allocation based on mathematical optimization formulations [13]. There is no

underlying design methodology which combines the fragmentation and allocation

techniques, they are typically treated independently.

2.3.2 Network scaling problems:-

The database does not have overall understanding of the entire distributed DBMS

design alternative in the form of performance implications [13]. Therefore, there are

some questions have been raised about the scalability of some protocols and

algorithms when the systems become geographically distributed or as the number of

system components increases [13]. There is one concern which is the suitable for the

distributed transaction-processing mechanisms i.e. (the 2PL and, particularly, the 2PC

protocols) in distributed database systems which is based on wide area networks [13].

There is an overhead is associated with these protocols, and implementing them over

a slow wide area network may pose difficulties [13].

2.4 Distribution Design Problems

For distributed databases, fragmentation and allocation are the major problems of

database distribution design. In the current research arena which often involves design

methods such as mathematical programing, in order to minimize the storing cost of

database, processing transactions against it, and communication cost [28]. Practically

it is very difficult to study database distribution design together with other problems

because every problem has their own difficulty to be studied.

2.4.1 The Complexity of the Problems

The problem of fragmentation and allocation together is proven hard

[28].Fragmentation and allocation are distribution design techniques which are used to

improve system performance. Each of them has massive search space for the best

solution of the problems.

Due to the complexity of fragmentation and allocation problems, the allocation is

treated independently from fragmentation [28]. From the previous literatures we find

that most of the allocation methods which accept fragmentation, in which

fragmentation has been done already, the fragmentation output will become the input

to allocation. To separate fragmentation from allocation is to simplify the formulation

of the problem by reducing the decision space, though the separation which

contributes to the complexity of allocation models [28]. Both steps take user

applications as input information and aim to improve system performance; they vary

only in that, where fragmentation works on global database schema while allocation

works on fragments. Thus, the application information and relationship between

fragments need to be specified again while doing allocation [28]. It would be worth to

develop a methodology which produces the interdependence of fragmentation and

allocation [28].

Page 24: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

24

2.4.2 Interdependencies with Query Optimization

Designing distributed database systems is a complex task as many other issues are

also involved, like query processing and optimization, data replication, concurrency

control, directory management, reliability, and recovery [28]. From the

aforementioned problems, query processing and optimization is a closely interrelated

problem with fragmentation and allocation. Query optimization in distributed systems

depends on how data are fragmented and allocated, since query processing defines the

sequence of operations of queries, and the allocations of the operations as according

to the allocation of fragments [28].

2.4.3 Improvised Solution for the problems mentioned

In the literature, to minimize the complexity of the problem and to increase the

problem controllability the researchers have mentioned the following methods.

The fragmentation and allocation are mostly treated separately as two different

steps. First the fragmentation is performed without considering how resulting

fragments will be allocated, while allocation is performed with the assumption

that fragmentation has been decided already [28]. Thus, allocation is

considered with the assumption that a fixed query optimization method is used

to generate processing schedule [28], while the study of query optimization is

conducted with an assumption of fixed data allocation [28].

Both simple query environment and query site strategy is assumed while

studying allocation. As per the first assumption, network information is not

considered [28]. While with the second assumption is, queries are not

considered, which need to be processed in a distributed way. Therefore, query

trees are not activated and allocation of intermediate nodes is not considered

[28].

During studying allocation query optimization is disregarded. A real fragment

allocation can only be achieved when distributed query optimization is

performed after fragmentation [28].

There are some other ad hoc solutions proposed in the literature which leads to the

effective solutions for the overall system design, by avoiding the interdependencies

between individual problems, which makes this approaches inefficient in the sense of

obtaining optimal database distribution design [28].

2.5 Initial Design Approach for Distributed database

Design One of the prime tasks of this work research is to investigate and develop the

fragmentation technique in distributed database environment which is used to manage

the data from various locations. We chose the top-down design process approach in

our research work for database fragmentation in the initial state of the design. A

framework for this process is shown in figure-5 [3] [5].

Page 25: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

25

The top-down approach is used frequently in different areas of computer sciences.

This top-down design process has required stages for designing the distributed

database. These stages share various level of information in incremental style for the

construction of homogenous distributed database system from scratch [5].

Requirement analysis

Distribution project

Logical projectConceptual Project

Physical project

User input

User input

Integration

Correction Correction

Figure 2: Stages of the top-down approach in distributed databases [3] [5]

Following are the stages of the top-down approach in distributed databases described.

2.5.1 Requirements analysis

In this stage the collection of information about the data, restrictions and relationships

within the organization is taken. The requirements analysis is understood through

meetings with the users where it can be observed that how the organization can

operates. After analyzing the requirement specification a document is created.

2.5.2 Conceptual project

In this level the data modeling and its relationships are formed independently as of the

structure representing the distributed database system (conceptual modeling). This

conceptual project can be recognized with analysis of the requirement specification.

Page 26: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

26

After completing conceptual project a conceptual schema with the data integrity

restriction is obtained.

2.5.3 Logical project

In this level the conversion of the conceptual project which represents the schema of a

Distributed database system i.e. logical schema. This project is understood by the

application of conversion rules, translation to the relational model of the distributed

database. At the end of the logical project a logical schema with tables, stored

procedures, views, access authorizations, etc. is obtained [5].

2.5.4 Distribution project

In this level the decision of how the data and programs must be allocated and

fragmented through the nodes of the computer network is taken. In few cases the

network itself is designed and built to satisfy the necessities of the distributed

database project. This level is said to be the most critical and important in the project

of a distributed database. To support this phase in top-down approach, we tried to

connect with generic five steps for data distribution with respect to fragmentation and

allocation in distributed environment which is explained in detail in section 2.8.

2.5.5 Physical project

In this level the logical schema is defined in a DDS which is suitable to the data model.

The physical project is recognized by means of SQL instructions. The result is a physical

schema with establishing in the distribution project. After finishing the physical project of

each node of the computer network the distributed database is ready for the use. To find

errors a process which monitors is prepared to discover. Such errors are the system

feedback and are sent to the people responsible for the construction of the distributed

database [5].

2.6 Fragmentation in Distributed Database Design Fragmentation:- “Fragmentation is a design technique to divide a single relation or

class of a database into two or more partitions such that the combination of the

partitions provides the original database without any loss of information” [28,p.3].

“A fragment i.e. horizontal or vertical of a database object in an object-oriented

database system contains subsets of its instance objects (or class extents) reflecting

the way applications access the database objects” [34.p.1].

Distributed processing on DBMS is an effective way of improving the performance of

applications which operates huge data [2]. The major goals of distributed database

design are to remove the irrelevant data accessed while executing the queries and

reducing the data exchange among sites. The primary goal of distributed database

design is to fragment the relation in case of RDBMS (Relational DBMS) or fragment

the classes in case of object-oriented-databases, to allocate and to replicate the

fragment in different sites of the distributed system with local optimization on each

site.

Page 27: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

27

Fragmentation is a promising design technique which is used to divide a single

relation or class in database schema into two or more partitions such that the

combination of the partitions provides the original database without loss of

information[28][4]. Horizontal fragmentation (HF) allows a relation or class in

database schema to be partitioned into disjoint tuples or instances [2]. Vertical

fragmentation (VF) also allows a relation or class to be partitioned into disjoint sets of

columns or attributes except the primary key [2].

Previous techniques of HF, VF or MF that are used have the following problems in

common:

Most of them uses frequency of queries, minterm predicates’ affinity or attribute

affinity matrix (AAM) as a basis of fragmentation. These require sufficient empirical

data that are not available in most cases at the initial stage [24][28].

Most of them concentrate only fragmentation problem and overlooked allocation

problem to reduce complexity [24].

Minimizing distributed joins is a fundamental fragmentation issue[3].

The second problem is related to semantic data control, specifically to integrity

checking[3].

2.6.1 Horizontal Fragmentation

Horizontal fragmentation is divided into two types they are primary and derived.

Whereas primary horizontal fragmentation of a relation or a class is implemented

using predicates of queries which are accessed by the relation or class, while derived

horizontal fragmentation of a relation or a class is implemented based on horizontal

fragmentation of another relation or class [28].

2.6.1.1 Primary Horizontal Fragmentation for Relational Databases

The primary horizontal fragmentation can be constructed with the context of the

relational data model and with the existing approaches for horizontal fragmentation

was first proposed by Ceri et al in 1982 [29] using minterm predicates.

Minterm-predicate-based approaches: “minterm-predicate-based approaches: which

perform primary horizontal fragmentation using a set of minterm predicates, e.g.,

[28.p.11][29].

Then after [24][30] proposed a technique based on attribute usage matrix (AUM) for

vertical fragmentation.

Affinity-based approaches: “which first group predicates according to predicate

affinities and then perform primary horizontal fragmentation using conjunctions of

the grouped predicates, e.g., [28] [30]. The way of grouping predicates is either

graph-based or using an objective function [28] [30]”.

From some of the literatures [28] we have taken few definitions related to minterm

predicates which are as follows.

Page 28: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

28

Definition 1: “For a given relation R = {A1: D1, ……, An: Dn}, a simple predicate is

in the form of Pk: Ai Ɵ(Teta) Value With Ai as an attribute defined over Di , Ɵ(Teta)

{=,<,} Ɵ”[28].

Definition 2. “Minterm predicates M = {m1,m2, . . . ,mz} over a set Pr of simple

predicates are the conjunctions of simple predicates and their negations: M = {mj |mj

= ^ pk2Pr p_ k}, k = 1, . . . ,m, j = 1, . . . , z. where p_ k = pk or p_ k = ¬pk. Note that

all simple predicates in Pr appear (positively or negatively) in each minterm

predicate”[28.p.12].

Definition 3. “ A set of simple predicates Pr is said to be complete if and only if there

is an equal probability of access by every application to any tuple belonging to any

fragment that is defined according to Pr [28.p.12]”.

By using minterm predicates to implement horizontal fragmentation was first

proposed Ceri and Pelagatti in the year 1982 [28] by which files are fragmented

horizontally to optimize frequency of access performed at different sites of data by the

application programs. In the proposed literature it states that this minterm fragments

have records which are accessed homogeneously by all the transactions performed

and this used as the proper units of allocation.

Several researchers have adopted affinity-based vertical fragmentation algorithms to

horizontal fragmentation. Due to the complexity of checking completeness of the set

of simple predicates used for horizontal fragmentation, Zhang [28] adopted an

affinity-based vertical fragmentation approach to horizontal fragmentation. This

approach takes predicate usage and predicate affinity matrix as input and employs the

bond energy algorithm to cluster predicates. However, the fragments in the resulting

fragmentation schema may overlap each other and therefore cannot satisfy the

correctness criteria of fragmentation.

2.6.1.2 Derived Horizontal Fragmentation

Derived fragmentation in the rational data model is referred to horizontal

fragmentation. Derived horizontal fragmentation is used to splitting up a relation in

dependence on another relation by applying semi-join operations [28].

The dependence among the relations is the depiction of binary relationship between

relations. The direct link is based on equi-join operations and also shown one-to-many

relationships [28]. The two criteria suggested by [28] for choosing the fragmentation

with better join characteristics or choosing the fragmentation used in more

applications [28]. Here, derived horizontal fragmentation is explained by example.

There are different relations such as employee, assignment, projects and salary. Every

relation has own primary key for selection of the records according to the predicate

constraints value.

Relations:

Employee : Employee ID, Employee Name, Title

Assignment: Employee Number, Project No ,Duration

Projects: Project No, Project Name, Budget, Location

Page 29: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

29

Salary: Title, Salary

The above underline attributes are referred to primary key attributes of the relations.

Similarly, assessed the foreign key relationships

Employee.Employee_ID Assignment.Employee_No

Projects. Project_No Assignment.Project_No

Salary.Title Employee.Title

Horizontal fragmentation of relation S based on the fragmentation of another relation

R where R is already fragmented into R1, R2, R3,----Rn. Using the semi-join operator

Si = S ∞ Ri = S ∞ σpi (R) = π S.*(S ∞ σpi(R))

fragmentation expression only refers to R. The following example has been shown the

mechanism of derived horizontal fragmentation. The relations have been distributed

into the more relations who are depended on each primary horizontal fragmentation

relation.

Project S1

Project_No Project_Name Budget Location

P1 Database Development 150.000 Jönköping

P2 Ontology based Portal 200.000 Stockholm

Table 1: Project S1

Project S2

Project_No Project_Name Budget Location

P3 Web Development 250.000 Göteborg

P4 Maintenance 100.00 Vaxjö

Table 2: Project S2

Similarly, distribute the relation R into to S1 and S2 for Assignment relation.

Assingment1 = Assignment ∞ Project S1

Assingment2 = Assignment ∞ Project S2

Assignment

Employee_No Project_No. Duration

E1 P1 5

E2 P4 4

E2 P1 3

E3 P4 5

E4 P1 4

E4 P3 5

E5 P2 7

Assignment S1

Employee_No Project_No. Duration

E1 P1 5

Page 30: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

30

E2 P1 6

E4 P1 4

E5 P2 7

Assignment S2

Employee_No Project_No. Duration

E2 P4 4

E3 P4 3

E4 P3 5

According to the above mechanism of derived horizontal fragmentation, we achieved

the desire fragmentation with join characteristics. The benefits of derived

fragmentation using join operations in distributed database to retrieve desire tuples or

records according to the predicate or minterm efficiently. Here, we tried to pick real

time scenario to express the mechanism of fragmentation and allocation in distributed

database system in section 2.8.

2.7 Previous works on Fragmentation in DDBS The two main design techniques of distributed database design are fragmentation and

allocation. Since 1970s database distribution problem has been studied, in the first

stage the problem of file distribution was found, then the problem of distributing

relations or relation fragments. Then after emergence of the object-oriented data

model, there are some existing approaches of fragmentation and allocation have been

adapted to the object-oriented data model. To get overall picture database distribution

design, we have presented an overview of previous work in database distribution

design with respect to horizontal fragmentation and allocation.

In the year (1999) Ozsu and Valduriez proposed an iterative algorithm called

COMMIN algorithm in which it generate a complete and a min-term set of predicates

from a given set of simple predicates [1],after getting min-term predicates the access

frequency is defined in his algorithm, by using access frequency table data is

fragmented as explain by Ozu.

Using predicate matrix as input, in the year (2002) Cheng et al. [28] [31] proposed a

genetic algorithm-based clustering approach, which treats horizontal fragmentation as

a traveling salesman problem (TSP). Horizontal fragmentation is achieved by

performing selection operation using the set of the grouped predicates, which are

grouped according to the distances. The distance of each pair of attributes actually

measure the access frequencies of transactions that do not access the pair attributes

together. Additional analysis is needed to simplify the clusters of predicates. None of

the affinity-based horizontal fragmentation approaches takes into consideration of

data locality while clustering predicates.

In the year (2004) Baioo et al. proposed a technique in which it gives input as a

predicate affinity matrix which builds a predicate affinity graph which than define

horizontal class fragments [24].

In the year (2006) H. Ma, K. D. Schewe proposed a technique in which he uses input

as an attribute uses frequency matrix (AUFM) based on this matrix and a cost model a

Page 31: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

31

vertical fragmentation is done [24].Than again in the year (2007) M.Alfares et-al.

proposed a technique by extending H.Ma technique in which he used input as AAM

(Attribute Affinity Matrix) to generate groups based on affinity values [24].

In the year (2008) Marwa et al. extended the M.Alfares er al. technique in which it

uses the instance request matrix which fragments the data horizontally for object

oriented database [24] [32].. In this paper he introduces a new algorithm for horizontal

fragmentation for an Object Oriented Distributed Database System OODDBS [32].

In the year (2009) Mahboubi H. and Darmont J. proposed a technique in which they

have used predicate affinity for HF in data warehouse [24] [33]. In this paper, they

work on XML warehouse fragmentation. In this paper there focus was on the initial

horizontal fragmentation of dimensions’ XML documents and exploits two alternative

algorithms [33].

In context to our studies there are some solution discuss in the research paper by

Shahidul Islam Khan and Dr. A. S. M. Latiful Hoque [24] which is publish in the year

(2010) has provided a fragmentation technique which can be applied at the initial

stage of database design of distributed database system. They have proposed a single

algorithm for both fragmentation and allocation which can be done simultaneously.

They have said that this technique can be used for initial fragmentation problem of

relational database for any distributed database systems. As from the literature review

we have found that this technique is most suitable to implement as per our

characteristic which we were searching for as shown in table-3 in chapter 5.

2.7.1 Database Fragmentation Technique by Shahidul

Islam Khan and Dr. A. S. M. Latiful Hoque

This technique is used to fragment a relation horizontally with the help of locality of

precedence of its attributes. “Attribute locality precedence (ALP) can be defined as

the value of importance of an attribute with respect to sites of distributed database”

[24, p.2]. Following is the block diagram of their system which depicts the

development of a fragmentation technique.

Relation

AllocationFragmentedSub-Relation

MCRUD FrequencyMatrix

Predicate Set

ALPTable

Figure 3. Block diagram of the system[24]

The block diagram provides a systematic working pattern of their technique in

sequential form. Firstly, a relation is taken from the database which needs to be

Page 32: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

32

fragmented, then a modified CRUD (Insert, Select, Update, Delete) Frequency matrix

table is created according to predicates (queries) of the selected relation. “A data-to-

location MCRUD matrix is a table of which rows indicate attributes of the entities of

a relation and column indicate different locations of the applications” [24, p.2]. It is

used by the database designers and system analyst in the requirement analysis for

making decision to map to different locations [24]. We customized the existing

Modified Create, Read, Update, and Delete (CRUD) according to our requirement and

name it as Customized Insert, Select, Update, Delete (CISUD) matrix. The reason

behind customizing the MCRUD matrix into CISUD matrix is to implement this

technique practically in real time scenario. The MACRUD technique provide an

algorithm and pseudo code to calculate the total ALP value from all three sites, thus

by using this technique we customized and improvised it, like calculating the ALP

value from individual sites and provide an architecture to implement this technique

practically.

2.7.1.1 Fragmentation Allocation algorithm

The algorithm is used to generate the ALP (Attribute Locality Precedence) Table i.e.

to calculate the importance of the attribute at a particular location. The overview of

the fragmentation allocation algorithm is explained in the following figure-3. The

input of the algorithm is total number of sites, Relation of the database which need to

fragmented and the CISUD matrix of relation, the output of the algorithm will be cost

of ALP which fragmented as F1,F2,F3…etc. In step1 ALP table is constructed from

ISUD matrix based on cost functions, in step2 for the highest value of the ALP table a

predicate set is created, which is then rearranged to fragment the relation to different

sites.

Figure 4: Algorithm for Fragmentation[24]

In figure-4, they expressed the pseudo code of the algorithm for the construction of

ALP (Attribute Locality Precedence) table which is explained in [24]. We have

customized and contributed in our research work by using MCRUD (Create, Read,

Update, and Delete) technique.

“ Input: Total number of sites: S = {S1, S2,… ,Sn}

Relation to be fragmented: R

ISUD matrix: ISUD[R]

Output: Fragments F = {F1, F2, F3,…, Fn}

Step 1: Construct ALP[R] from ISUD[R] based on

Cost functions

Step 2: For the highest valued attribute of ALP table

a. Generate predicate set P={ P1, P2, … ,Pm }

b. Rearrange P so that #P = #S

c. Fragment R using P as selection predicate

(R) p p

d. Allocate F to S ”[24].

Page 33: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

33

The pseudo code algorithm contains two parts. Firstly, CISUD (Insert, Select, Update,

Delete) matrix of a relation which needs to be fragmented is taken as input. Secondly,

ALP table is achieved as output of that relation. The pseudo code algorithm consist of

five nested-for loops for calculating the cost of each attribute i.e. ALP of the relation

[24].

Figure 5.ALP-table-construction Pseudo-code [24]

Input: ISUD of a relation that to be fragmented

Output: ALP table for that relation

for ( i =1; i <= TotalAttributes; i++)

{

for ( j =1; j <= TotalPredicates[i]; j++)

{

MAX[i][j] = 0;

for ( k =1; k <= TotalSites; k++)

{

for ( r =1; r <= TotalApplications[k]; r++) /* Calculating sum of

all applications’ cost of predicate j of attribute i at site k */

{

C[i][j][k][r] = fc*C + fr*R + fu*U + fd*D

S[i][j][k] + = C[i][j][k][r]

}// end of forth loop

If S[i][j][k] > MAX[i][j] /*Find out at which site cost of

predicate j is maximum*/

{

MAX[i][j] = S[i][j][k]

POS[i][j] = k

}

SumOther = 0

for ( r =1; r <= A[i][j][k][r]; r++)

{

If (r!=k)

SumOther + = S[i][j][r]

}

}// end of third loop

ALPsingle[i][j] = S[i][j][POS[i][j]] – SumOther /* actual

cost for predicate j of attribute i */

}// end of second loop

ALP[i] = 0

for ( j =1; j <= TotalPredicates[i]; j++) /*calculating total

cost for attribute i (locality precedence)*/

{

ALP[i] + = ALPsingle[i][j]

}

}// end of first loop

Page 34: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

34

Above figure is a pseudo code of the algorithm of fragmentation allocation which is

shown by [24]. By using this algorithm we choose to test with our requirements and

fragment the database as accordingly.

2.7.1.2 Mathematical Measurement of the algorithm

To execute the algorithm there are some mathematical formulas and equations are

used in [24], which are often used to calculate the cost of ALP and also used to test

the algorithm with different operational changes. By considering these linear

combination equations we try to test the algorithm with different frequencies retrieve

from the customized ISUD matrix table. Therefore cost is treated as the effort of

access and modification of certain attribute of a relation by an application from a

particular site [24]. To calculate precedence of an attribute of a relation we can take

the CISUD matrix of the relation as an input with the following cost functions. The

equation (1) is used to calculate the cost of sum of frequencies, the equation (2) is

used to calculate the total cost of frequencies at particular site, equation (3) is used to

get the maximum cost among the sites for predicate j of attribute i. And the equation

(4) is used to calculate the total cost of attribute (i.e. locality precedence) [24]. All

following equations are executed in user interface application code development. The

customized ISUD frequencies can be retrieved automatically from CISUD matrix

table with the help of user interface.

Ci, j, k, r = fiI + fsS + fuU + fdD (1)

Ai j k

Si, j, k = ∑ C i, j ,k, r (2)

r =1

Si, j, m = Max (Si, j, k) (3)

Ai j k

ALPi j = Si, j, m - ∑ S i, j ,k (4)

k≠m

l

ALPi = ∑ ALPi j (5)

j= 1

Here fi = frequency of Insert operation

fs = frequency of Select operation

fu = frequency of update operation

fd = frequency of delete operation

I= weight of Insert operation

S = weight of select operation

U = weight of update operation

D = weight of delete operation

Ci, j, k, r = cost of predicate j of attribute i accessed by

Application r at site k

Si, j, k = sum of all applications’ cost of predicate j of

attribute i at site k

Si, j, m = maximum cost among the sites for predicate j of

attribute i

ALPi j = actual cost for predicate j of attribute i

Page 35: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

35

ALPi = total cost of attribute i (locality precedence)

By using the above functions, the designer can calculate the actual cost ALP of

particular attribute. Generally update function acquires more cost than other function

like insert, select and delete function acquires least cost from different sites of

applications. The given constant weights of the ISUD frequencies are I=2 for insert,

S= 4 for select, U=3 for update and D=1 for delete, the justification of giving constant

weights is during the design time of DDB, the designer is unaware of occurrence of

frequencies of Insert, Select, Update, Delete of particular attribute from different sites.

The following Airline’s Reservation System Database [18], describes the mechanism

of data fragmentation and allocation over the distributed environment in real time

scenario for better understanding for implementation point of view in following

section.

2.8 Generic Five Steps for Data Fragmentation and

Allocation in Distributed Database Systems

Five steps method is the systematic approach which leads to construct data allocation

with respect to fragment in distributed database environment [18]. One of the

objectives of this steps is to give the overview of concrete example from literature

review to convince the readers, how the data fragmentation can be possible in real

world. These steps are taken from the [18] which explains about the distributed data

fragment, and allocation of data at various sites. We tried to make relation with our

research work which we will present in later part of our report. Here, are the following

steps [18].

Step 1: Collect Existing Global Relations

Step 2: Analyse Frequently Asked Queries (FAQs)

Step 3: Set Data Allocation Objectives

Step 4: Transform Global Relations into Fragment Relations

Step 5: Allocate Fragment Relations to Sites

We chose some of the steps from generic five steps approach for data fragmentation

and allocation in distributed environment in our research work which provides some

inspiration to the readers, how we can perform the data fragmentation phenomenon in

the real world.

Page 36: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

36

2.8.1 Collection of Global Relations

The first step of five step method for data allocation in distributed database systems is

based on collection of global relations. The relations are referred to table in the

database systems. The design of global relations is based some specific procedure of

entity-relationship modelling and normalization [18]. Here, we tried to express these

steps in real time discussion of our case study (Bharat Transport Services) case study

in our research work so that we could realize the phenomenon of data allocation as

fragment, replicate and distribute [18].

2.8.2 Frequently Asked Question (FAQs)

The second step of this approach is utilized to analyse frequently asked queries

according to the end-users queries. According to the airline’s reservation example

[18], the classified FAQs of an airline’s reservation system is classified into various

categories see in [18]. The classification of frequently asked question invites the users

to understand what type of data could be retrieved at different sites of the distributed

database. We tried to correlate our case study (Bharat Transport Services) with the

classification of the frequently asked queries at different sites from the end-users

demands and needs. The user query can be executed to retrieve the data from different

relations using SQL query structure. The answer of this SQL query is retrieved at the

site of any destination.

2.8.3 Data Allocation Goals

The third step of five steps approach is used to set data allocation goals in distributed

environment. These goals highlight some characteristics which tried to achieve data

allocation objectives. These characteristics support and help to increase the

availability and reliability of the data against end-user queries at different sites and

focus to reduce the communicational cost for data transfer over the distributed

environment. This step also highlights the importance of storage cost and emphasis

that how this generic steps contribute to reduce the storage cost.

This step also exploits the phenomenon of parallelism by utilizing the resources of

other sites at the time of query processing whenever it is possible. So for this purpose,

data replication technique is the ultimate choice to achieve data allocation goals [18].

So, we tried to express the data allocation step in our case study (Bharat Transport

Services) to achieve data fragmentation objectives in our research work in later

section.

The aforementioned steps describes about the relations in the database of Bharat

Transport Service which is explained in detail in section 1.1.2, the second step

explains about the retrieved results against end-users queries from the database of

Bharat Transport Service in the form of results. The third step defines the set data

allocation objectives with respect to increase availability and reliability, minimize

communication cost factor, minimize storage cost in distributed environment.

The fourth step highlight the transformation process can be possible and how relations

can be converted into fragment relations at single site with the help of data

fragmentation technique e.g. horizontal data fragmentation technique. The fifth step

explains how fragmented data can be distributed over various sites and allocated

Page 37: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Theoritical Background

37

fragment relations to different sites. The last three steps are exempted from our

research work, so we try to give an overview of the allocation part to the reader with

an example given in [18] , as it has the connection with the first two steps.

The generic five steps for data fragmentation and allocation is interpreted in detail in

[18] with structured manner to describe the phenomenon of allocation of fragment

relations over different sites concurrently for better understanding. In this thesis work,

we chose data fragmentation technique [24] which we will explain in the next

proceeding of research report for implementation is concerned.

Page 38: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Research Methods

38

3 Research Method In this thesis, we are going to highlight the importance of the research methods which

we have adopted in this report. Research methods are categorized into three major

levels: a high level research method, a low level method for research design and a low

level research method. The high level research method is used for data inquiry during

system requirement assessment from the domain experts and knowledge mentors.

Similarly, we have also utilized two types of low level research method, one is used

for conducting the overall research design perspective in this thesis work and another

(Design Research method) is used for the development and implementation of 5-layer

architecture for database fragmentation in distributed environment in a constructive

and systematic way.

The concept of research methodology is to support a diligent, rigorous and a

systematic process of investigation of the specific problem to describe effective

solutions and develop test explanatory concepts, theories and applications [16].

Figure-6 describes the abstraction of the research design method.

Domain’s

Contextual Problem

Domain’s

Contextual Problem

Optimal Solution

to Domain’s Contextual

problems

Optimal Solution

to Domain’s Contextual

problems

High Level Research Method for

Data Inquiry

Low Level Method for Research

Design

Low Level Method for Implementation

(DSR)

Research Design Process

Figure 6:Research Design Method [7]

3.1 Categories of Research Methods

There are mainly three research methods used in this thesis work to investigate the

problems at different levels to achieve effective and optimal solution in systematic

way.

1. High Level Research Method for Data Inquiry

2. Low Level Method for Research Design

3. Low Level Method for Implementation (Design Science Research

Methodology)

The above three methodologies are quite convincingly used in our research work with

respect to data collection from the domain experts, to conduct the overall design of

our research work and implementation of the proposed technique which is perceived

from the literature review. The following proceeding can be explained in detail.

Page 39: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Research Methods

39

3.2 High Level Research Method for Data Inquiry

For information collection, we have chosen different ways for acquiring the

information from the domain’ experts. We have utilized following high level methods

for information collection is concerned.

1. Meeting Session

Here, the meeting session is considered as high level method which is a very useful

method for acquiring the information from domain experts. We have conducted the

meeting session at the start of our research work with our supervisor and knowledge

mentors in Swedish Board of Agriculture, to grasp the idea of the problem context

because we have to focus what we want to achieve at the end of this thesis. The

primary agenda of this meeting session is to discuss and highlight the problems about

data fragmentation over distributed database environment from the literature review

and specify the scope of the work. This meeting session with our supervisor

knowledge mentors is highly motivated and helpful for understanding the data

fragmentation and allocation problems in distributed environment and how we can

focus on achieving optimal results.

3.3 Low Level Method for Research Design

We have chosen constructive research method for designing the research work in

systematic way.

3.3.1 Constructive Research

The constructive research is considered one of the most popular methodologies for

designing the research, because it helps to support the ability of problem solving, in

selective and combined previous learned theories, procedures, declarative knowledge

and cognitive strategies to solve the unknown problems in specific subject’s

knowledge [6][7]. We have derived certain steps from [6], which are useful in our

research work.

3.3.1.1 Constructive Research Steps

These following steps are necessary to conduct constructive research [15][16]

Step 1: Prepare the case study according to the discussion in the meeting session

with the help of domain experts.

Step 2: To define the domain problems

Step 3: To define the scope of the domain’s problem

Step 4: Develop and describe the design for solutions

Step 5: Deploy proposed solution for implementation and testing

Step 6: To evaluate scope of the solution with knowledge mentors

Step 7: Refine the design structure of the solution after getting feedback from

domain users and domain experts

Page 40: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Research Methods

40

3.3.2 Phases of Constructive Research

The description of these steps helps to give detail in figure-7 for good understanding

about different phases in constructive research.

3.3.2.1 Preparing Case Study

The purpose of the case study is to give enormous understanding of domain’s

problems for quality work. For the preparation of the case study, it is very important

to understand the domain contextual knowledge for analyzing the nature of problems

and proposing effective solutions in the various contexts.

3.3.2.2 To Define the domain problems

For conducting good research, it is important to define domain problems so that the

researchers can take some initiatives for addressing these problems.

3.3.2.3 To define the scope of the domain’s problem

It is necessary to mark the boundaries of the domain’s area of concentration to find

optimal, efficient solution. This strategy would help the usage of resources which are

used in the domain’s problem [16].

3.3.2.4 Develop and describe the design for solutions

This phase describes the designs of the solutions for addressing the problems. Here is

the stage at which we have developed the design models to illustrate the problem

domain [16].

3.3.2.5 Deploy proposed solution for implementation and testing

This phase emphasizes different development strategies in terms of the design models

to address the problems in the given context of organization. This phase also, focuses

on some testing perspective to ensure that the prototype is fulfilling the domain’s

user’s requirements [16].

3.3.2.6 To evaluate scope of the solution with knowledge mentors

This phase emphasizes for the evaluation of the defined scope in the research design

process. At this stage, we have evaluated scope of the solution through some domain

experts. Domain experts will evaluate the design model by using different queries

according to the certain requirements in prototype.

3.3.2.7 Refine the design structure of the solution after getting feedback from

domain users and domain experts

This phase describes the refinement of the design structure of the solution after getting

the feedback from domain’s users and from the domain experts. This stage provides

invitation for the researcher for future improvement in the design of the model in this

research work.

Page 41: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Research Methods

41

Domain Experts

&

Researcher

Construct Case

Study

Prepare

Case Study

Define the domain

problemsObserve

Define the scope of

the problem’s

domain

Define

Develope

Design

To evalaute the

scope of the

solution

Feedback Feedback after

evaluation

Refine the Design

Structure

Deploy proposed

solution for

implemenation

Evaluate

Scope

Develope and

describe the design

for solutions

Deploy

solution

Figure 7: Constructive Research Methodology for Research Design

3.4 Low Level Design Research Methodology (DSR) for

Implementation Design Science Research (DSR) methodology is considered to be one of the

promising methods to conduct systematic design research various science disciplines

or in a developing industry. DSR has contributed to natural science research and

proposed generally four outputs for design science research: 1) constructs, 2) models,

3) methods, 4) instantiations [22]. This research methodology consists of various steps

which invite the practitioners and researchers to make the design rationally. The

methodology is used for the implementation of the defined algorithm. The following

steps are explained in context of our research questions i.e. from the point of view of

problem awareness and then the implementation of the suggested solutions is

presented. The testing of the algorithm is shown in developing and evaluating steps.

3.4.1 Steps of the Design Science Research Method (DSR)

These steps of design science research method (DSR) are illustrated as following:

1. Awareness of the problem

2. Suggestion

3. Development

4. Evaluation

5. Conclusion

The above steps are defined in the following pictorial diagram in figure-8.

Page 42: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Research Methods

42

Knowledge Flows Process Steps outputs

Figure 8: The General Methodology of Design Science Research [22]

3.4.1.1 Awareness of the Problem

The awareness of the problem comes from various information’s channels: new

development in industry or in a reference discipline [22]. The output or finding of this

phase is a proposal, formal or informal, for a new research effort to encourage the

researchers and practitioners to have serious input to understand the context of the

problem in different domains [22]. In this research work, we obtained the awareness

of the problem from knowledge mentors of the company called Swedish Board of

Agriculture and detail of the case study can be found in section 1.2.1 and from

literature review [24], about the assess of the data fragmentation and allocation

problem.

3.4.1.2 Suggestion

Suggestion is the next level of awareness of the problem or of following immediately

behind the proposal [22]. In any formal proposal for design science research (DSR), a

tentative design would be an integral part of the proposal. “Tentative design is an

essentially creative step wherein new functionality is envisioned based on a novel

configuration of either existing or new and existing elements. There are different

approaches to address the problems of software system complexity. Some of the

alternatives that were discarded included development of a new software development

methodology specifically focused on operation support systems, automation of the

maintenance function, and development of a high-level programming environment

“[22].

Suggestion is an essentially creative step where in new functionality is envisioned

based on a novel configuration of either existing or new elements [22]. In the

suggestion phase, we have included some creative steps after analysis of extensive

literature review [24] related to the problem of fragmentation and of allocation in

Development

Proposal

Tentative Design

Awareness of the

Problem

Suggestion

Evaluation

Conclusion

Artifact

Performance

Measures

Results

Operation and

Goal Knowledge

Circumscription

Page 43: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Research Methods

43

distributed database architecture which is proved by applying and testing of ISUD

matrix technique about Barat Transport Services.

3.4.1.3 Development

In the development phase, the tentative design is further developed and implemented.

The elaboration of the tentative design into complete design requires creative efforts.

The mechanism for developing and implementing the techniques is varied and

depending on the artifacts to be constructed. For formal proof, an algorithm may be

required for the construction of the technique [22].

In the development phase, we have proposed a 5-layer architecture to describe the

pattern of horizontal fragmentation which can be seen in section 4.2. In this section,

we have also described some development steps which are involved in various phases

to show development activities.

Step 1: We have chosen small scale relational database of Barat Transport Service

which has been mentioned in section 1.2.2 that explains the real time

scenario in initial phase of implementation.

Step 2: The database designer designs ISUD matrix (insert, select, update, and

delete) with the help of cost functions in distributed environment. This ISUD

matrix, with its cost function, is used to test the algorithm. In real time

scenario, the variation in the algorithm can be checked with various

approaches as described in chapter 4.2.

Step 3: The Attribute locality precedence (ALP) table that can be defined as value of

importance of an attribute with respect to sites of distributed database will be

generated after running the algorithm with the help of ISUD matrix for each

relation.

Step 4: To define the predicate set (P) for each relation which will be generated for

the attributes with highest precedence value in the ALP table and also defines

the behavior of the information retrieved from the relation.

Step 5: According to highest valued attribute of ALP, fragment the relation (R) using

predicate (P) as selection predicate

Step 6: Allocate the fragmented data according to the predicate or query over the

various sites (S) in distributed environment.

3.4.1.4 Evaluation

In the evaluation phase, the artifact is evaluated according to the set criteria that are

always implicit and frequently made explicit in the proposal or awareness of the

problem phase [22]. In the evaluation phase, the results and additional information

gained in the construction and in the running of the artifact are brought together and

fed back to another round of suggestion [22]. The evaluation phase emphasizes the

performance and measurement of algorithm or design technique for the judgment of

the results from different ways. These ways are defined clearly from the proposal or

awareness of the problem. We have provided the prototype which demonstrates the

horizontal fragmentation.

Page 44: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Research Methods

44

The evaluation activity can be confirmed through the execution of the algorithm that

gives the demonstration of data fragmentation and allocation over different sites in

defined distributed environment of real time scenario.

1. The evaluation activity can be verified and evaluated through changes in the

frequencies of ISUD (Insert, Select, Update, Delete) table which are defined in

the implementation part. The evaluation activity can also be assessed through

the domain users against different transactional queries with different

frequencies over the algorithm.

2. The testing of the algorithm is done on the basis of changes in the frequencies.

The detailed explanation is shown in the implementation part of the proposed

model in chapter 4.2.

3.4.1.5 Conclusion

The conclusion phase is the final stage of a specific research effort. The results are

focused to address the data fragmentation problems. The main contribution of the

conclusion is to achieve results, which are defined clearly in the purpose or objective

of the proposal. We conclude after the evaluation phase from the domain experts and

knowledge mentors, that the results are authentic and that they are truly mapped

according to the purpose of this thesis.

Page 45: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

45

4 Results In this section, we have explained the results in terms of two categories. These

categories explain the theoretical results and practical results they address the

research questions in our discussion and focus on achieving the objective of this

research work. These results which are presented in following section are based on

the purpose of our research work explained in section 1.3. We have utilized

constructive research to justify the theoretical findings in the form of a comparison

framework from literature review. We have also chosen research methodology

design science research method (DSR) to achieve practical results in a systematic

way. For the development of the practical results, we have used some development

tools like Microsoft Visual Studio 2008 version, using C#.Net for the execution of

the our proposed CISUD matrix technique. More explanation of these results can be

seen in the following parts. How we have achieved the results is also shown.

One of the significant contributions of our practical results is to test the linearity of

the algorithm. We have utilized proposed customized ISUD matrix technique which

helps to test the algorithm on the basis of frequency of particular attributes in

distributed environment.

4.1 Theoretical Results In this section, we have justified the answer to the first research question "What

algorithms do exist in order to uniformly fragment the relations in a distributed

database?”, by using a comparative study framework of different techniques which

have been proposed by different researchers to support the data fragmentation

phenomenon in this research work. We have also tried to convince readers, through

the characteristics, explained in table-3, which highlight the importance of techniques

in an extensive study of literature review. We have also assessed those different

methods that have been used by different researchers explained in table-3. This

comparative study is based on different techniques, algorithms approaches and

methods which are utilized to fragment and allocate the data over the distributed

database environment.

Characteristics

Cheng

et al

(2002)

Baioo

et al.

(2004)

H. Ma,

K. D

et al.

(2006)

H. Ma,

K. D

et al.

(2007)

Marwa

et al .

(2008)

Mahbo

ubi H.

and

Darmo

nt J

(2009)

Dr. A.

S. M.

Latiful

Hoque

(2010)

Custo

mized

ISUD

Tech.

Distributed Database

designing at initial

stage for partitioning

the relations.

No No No No

No

Yes

Yes

Yes

Horizontal

Fragmentation

algorithm

Yes Yes No Yes Yes Yes Yes

Yes

Affinity matrix to No Yes Yes No Yes Yes Yes Yes

Page 46: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

46

Table 3: Comparison Framework of different techniques with respect to key characteristics

The above said characteristics have been explained in detail in respective research

articles which is presented in two-dimensional form in table-3. This comparison

framework describes about different techniques, algorithms, methods and approaches

which were explained by various researchers. On the basis of this comparative

framework, we have analyzed the precedence and credibility of the characteristics in

various authors work through literature review for data fragmentation in distributed

environment and finally, we have made the judgment, what specific approach/method

or algorithmic technique is suitable to address defined questions in section 1.3.

Table-3 gives the holistic view to analyze and describe different characteristics such

as Distributed Database designing at initial stage for partitioning the relations,

Horizontal Fragmentation algorithm, Affinity matrix to build a predicate, Relational

Database, Complexity of the Technique, Allocation decision support, Performance,

Efficiency which are explained in detail by various authors in their respective

literatures. If the mentioned characteristics are present in respective research papers

we have mentioned as “YES” and if it is not present we have mentioned as “NO” in

the above Table 3. Based on this comparison we have selected Latiful et-al. technique

for implementation as we found this technique is more efficient and easy to

implement. Thus, These aforementioned characteristics fully support our proposed

customized ISUD matrix technique which is the modified version of Latiful et-al., to

address the phenomenon of data fragmentation in distributed environment efficiently.

After analysis of this framework, we are confident to answer the question, what

algorithms do exist in order to uniformly fragment the relations in a distributed

database?

4.2 Practical Results

This section describes the explanation of practical results and highlights mechanisms

of development of customized ISUD matrix technique and the way we have achieved

these results after the practical implementation which is explained in following

section 4.2.1. These results address the second research question:

“How to design the architecture of designated algorithm from Q1?

How to implement and test the proposed algorithmic approach?”

4.2.1 Proposed 5-Layer Architecture

In the beginning of implementing distributed database after the detailed conversation

with domain experts and knowledge mentors, we have assessed and analyzed the

nature of the domain problems for good understanding. So, this section explains about

build a predicate

Relational Database Yes No Yes Yes No No Yes Yes

Complexity of the

Technique Yes Yes Yes Yes

Yes Yes No No

Allocation decision

support No No Yes Yes

Yes No No Yes

Performance - - - - - - Yes Yes

Efficiency - - - - - - Yes Yes

Page 47: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

47

the part of our second research question i.e. “How to design the architecture of

designated algorithm from Q1?”. We have chosen top down approach for designing

distributed database architecture in our research work which is explained in detail in

section 2.5. The top-down approach gives initiatives to the database designers to build

homogenized distributed database system from scratch and also facilitates to share

information at different levels in incremental style. We have also tried to correlate the

top-down approach with generic five steps for data fragmentation and allocation

which are already explained in detail in section 2.8. These generic five steps scheme

is considered as a starting point for developing a 5-layer architecture in our research

work for distributing data fragmentation and allocation in distributed database system.

Here, we had to take a proper decision to solve the problem of fragmentation. We

have chosen the technique [24] as a foundation for database fragmentation which is

explained in section 2.7.1, and also customized it in our research work for data

fragmentation in distributed environment. The attribute locality precedence (ALP)

table can be designed and developed by the database designer for each relation of a

specific database system for a distributed environment. This can be done during the

time of designing the database with the help of modified ISUD (Insert, Select, Update,

and Delete) matrix and cost functions. These cost functions are explained in section

2.7.1.2. For a better understanding of this layer architecture which is explained in a

block diagram in Figure-9, the 5-tier architecture is set explain different layers such as

Application Layer, Database Layer, Mediator Layer, Fragmentation Layer, and

Allocation Layer.

This architecture gives a holistic view that explains the functionalities of the different

layers which are useful for fragmentation of the database in the distributed

environment.

The Application layer provides generic overview of different sites and a

specific application which runs on these sites. Application layer also provides

the communication between user interface and backend database repository.

The Database layer provides an object view of database which organizes the

data by applying different database operations such as DDL (Data definition

language), DML (Data manipulation language) etc.

The Mediator layers (user-interface) serves as middleware which provides the

facility for the connection between database layer and fragmentation layer.

This layer provides overall functionality of the algorithmic approach in our

research work to facilitate the end-users to retrieve the ALP table and

individual ALP table simultaneously and helps to present these data into

graphical form for better analysis and understanding.

The Fragmentation layer is responsible to take the decision for fragmenting

the relation on the basis of highest value of attributes retrieved from the

mediator layers (user- interface). These layers also communicate with the

database administrator for the decision making for fragmenting the relation to

different sites at the start of database designing in distributed environment.

The Allocation layer helps to allocate the fragmented data over the distributed

sites. This layer is excluded and the functionality of this layer is not

applicable in our research work.

Page 48: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

48

S1

S2

S3

App1

App1

App1

Database

Repository

(ISUD Matrix)

1. Application

Layer

2. Database

Layer

ISUD User Interface

Algorithmic

Approach

Get ALP

Table

Get

Individual

ALP Table

Get Graph

3. Mediator

Layer

Decision Process

based on attribute

Value at differnt sites

Database

Administrator

4. Fragmentation

LayerAllocation at

different sites based

on decision process

Excluded

5. Allocation

Layer

Figure 9: 5-Layer Architecture for Proposed Fragmentation Technique

The functionalities of above mentioned layers are discussed in detail in the following

section which is concerned specifically with our research work.

4.2.1.1 Application Layer

To test proposed technique, we have chosen the case study of Bharat transport

service which is explained in section 1.2.2. We have taken one of the applications

from Bharat transport service software system which is shown in figure-10. This

application layer contains three different sites such as S1, S2, and S3. For simplicity,

we have chosen a homogenous application named billing information from case study

at each sites. The specific application facilitates end-users for data storing in terms of

insertion, update, and deletion modification into the specific database system of

Bharat transport service. The end users can use the application and its functionality

according to their demands from different sites in distributed environment. Each

application has its own relation or table in a local database of Bharat transport service

which contains various types of attributes that accord with the end-users requirement.

Page 49: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

49

Figure 10: Application of a Case Study

We have constructed our own set-up to justify our proposed technique. We have taken

three different sites at local machine for testing purpose to minimize the networking

connectivity problem over the intranet or internet. So from the local machine, we have

taken three local drives such as C, D and E as three different sites. The application

which is used from the case study as shown in figure-15 is installed at all three drives.

The C: drive is used as site1, the D: drive is used as site2 and the E: drive is used as

Site3. All three sites share common database which is saved in the drive C. The

application used from the case study provides various functionalities such as insert,

update, delete and select operations depending upon a set of predicates or conditions

given by the database administrator. Whenever the end user runs any query of the

application from any sites, the access record of the query is saved in the ISUD matrix

table in the database at site1 of local machine.

4.2.1.2 Database Layer

The database layer consists of three sub-layers with database systems. The following

relation of database in figure-11 and figure-12 defines the physical storage of data in a

database system.

4.2.1.2.1 Database of case study application

In this task, we have taken already defined database of case study application which is

developed in MS access database repository. The relation in a database consists of

different types of attributes which are described as properties of a relation according

to the case study requirements and demands.

Page 50: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

50

Figure 11: Database of Case Study Application

Above figure-11 is a database of cases study application in MS access which has

various relations. For the testing customized ISUD technique we use “Description”

table which can be seen in figure-11.

4.2.1.2.2 Constructing of Log Files for Customized ISUD Matrix

In this task, we are creating the customized ISUD matrix with the help of the log file

code which is embedded in the application’s code of the case study. The log file code

is the core element for creating the CISUD matrix and User Interface. The log file

code is responsible for saving the data from different sites, with its attribute name,

attribute value, predicate name and time of access in database respectively. The

CISUD matrix table is created with the help of log file which is shown in figure-18.

4.2.1.2.3 Database for Customized ISUD User-Interface

A database for Customized ISUD User-Interface is created to get the CISUD

information. We can build CISUD (Customized Insertion, Selection Updation,

Deletion) matrix table in any database management systems but we have utilized the

MS-Access database for the execution of CISUD matrix table. A data-to-location

CISUD matrix is a table in which rows indicate attributes of the entities of a relation

and columns indicate different locations of the applications [24]. The log-file at each

site is responsible for creating the CISUD matrix table in database at specific site

against the end-user query with respect to site name, attribute name, attribute value,

predicate name and time of access. The log-file code is shown in detail in section 8.2.

Page 51: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

51

Figure 12: CISUD matrix table

The above figure-12 depicts the construction of CISUD matrix in the database in our

research. From the existing technique [24], we have customized ISUD (Insert, Select,

Update, and Delete) matrix technique according to the requirements of the case study.

The Customized ISUD matrix is a table which is constructed by inserting predicates

of attributes of a relation as the rows and application as the columns for the sites of a

DDBMS.

4.2.1.3 Mediator Layer (Algorithmic Approach)

The role of the mediator layer is considered to be as the core functionality of the

proposed technique because this layers explains what is the novel contribution of our

research work. In the mediator layer user interface of CISUD application is design

and implemented. The mediator layer takes input as CISUD matrix table from the

database layer. Following are the general algorithmic approach of our technique.

Algorithmic Approach

1. Input: a. Total number of sites: S = {S1, S2,… ,Sn}

Relation to be fragmented: R

1b. Select the attribute and its value

1c. Select ISUD(Insert, Select, Update, Delete) Frequencies

From CISUD matrix table: ISUD[R]

2.Output: a. Total ALP value

2b. Individual ALP value

2c. Fragments F = {F1, F2, F3,…, Fn}

2d. Graphical Representation of ALP values

Page 52: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

52

3. Construct ALP[R] from ISUD[R] based on

Cost functions

4. For the highest valued attribute of ALP table, select individual ALP value then

4a. Generate predicate set P={ P1, P2, … ,Pm }

4b. Rearrange P so that #P = #S

4c. Fragment R using P as selection predicate

(R) p

4d. Allocate F to S.

The pseudo code of the above algorithm can be seen in section 8.3.

4.2.1.3.1 Designing the User Interface for Customized ISUD Application

The user interface of ISUD application is developed in the C#.Net windows

application which is used to calculate the precedence of attributes according to the

given predicate which is called ALP (Attribute locality precedence) with the help of

existing algorithm defined by [24]. The customized ISUD table is considered as input

to the user interface. A User interface is also used to calculate the frequencies of

ISUD (Insert, Select, Update, and Delete) and generating a ALP table, which will

show the attribute with highest precedence value that is then, treated as the most

important attribute for fragmentation. This user interface provides different

functionalities to the end-users to retrieve ALP table according to set of predicates and

also retrieve individual ALP table according to the site. It also provides graphical

representation of the ALP table results which explains the testing of the algorithm.

The user interface for customized ISUD application is shown in figure-13.

Figure 13: User Interface for CISUD Application.

Page 53: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

53

4.2.1.4 Fragmentation Layer

In the fragmentation layer, the decision for the horizontal fragmentation is taken by

database administrator, being based on the attribute locality precedence (ALP) which

is generated from user interface. A predicate set is used to take the fragmentation

decision after ALP table values are retrieved. The predicate set which is constructed

based on the highest individual ALP table values according to sites for each relation

of the case study application. This selected predicate set has become the starting point

for horizontal fragmentation in each relation of the case study application. In the later

part of this section, we can understand the construction of the predicates set with the

help of ALP table.

4.2.1.4.1 Constructing Predicates Set

The predicate set is constructed based on two things: highest individual ALP values

and total ALP values. The highest individual ALP describes the importance of the

predicate or the attributes value according to individual sites which is shown in figure-

14 for good understanding. The total ALP also explains the total attributes values

from all sites. After getting the ALP value of the relation with the help of the user

interface, the end users has the flexibility to construct the predicate set at individual

site. The end user or the database administrator can take the decision to fragment the

data horizontally at a particular site on the basis of predicate set. The following

figures shows how to set and get the predicate set value from each site.

Figure 14: Interface for setting and getting the Predicate set for individual

highest attribute.

The above figure-14 is an interface for getting the values of the predicate set

individually with selected sites. This interface was built in C#.Net windows

application.

Page 54: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

54

Figure 15: Predicate Set for Highest Attribute Precedence at individual site

P = {site1: BillNos = 1, site2: BillNos = 2, site3: BillNos = 3}

By using figure-15 above, we have constructed the predicate set P which describes the

importance of the attributes values at different sites for taking fragmentation’s

decision at a particular site according to the highest ALP value. The blue colored lines

which are mentioned in figure-15 show the highest predicate value of particular

attribute at particular sites.

4.2.1.5 Allocation Layer

The discussion of the allocation layer is beyond the scope of our thesis, a layer which

is excluded in our proposed technique, but it could be taken as further work to

enhance our proposed technique for fragment data allocation. This layer plays a vital

role for the allocation of data fragmentation at different sites, which can be seen in

figure-16 as an example. Here, we are not showing the allocation process, but by the

help of predicate set, the end user or the database administrator can take initiative to

allocate the data to particular sites. The detail theoretical demonstration of allocation

can be seen in section 2.8 as five generic steps, which could be helpful in further

developing the distributed database environment.

Figure 16: Allocation of Fragments

Site1 Site2

Site3

BillNos =2

BillNos =3

BillNos =1

Page 55: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

55

The above figure-16 shows the allocation of data at different sites based on the

attribute locality precedence.

4.2.2 Testing the Proposed Algorithmic approach

To test our proposed algorithmic approach practically, we have developed a user

interface; the detailed explanation can be found in section 4.2.1.3, which helps to test

the algorithm as per the changes in the frequencies of ISUD, which is used as inputs.

In this section, we tried to address the other part of our second research question i.e.

“How to implement and test the proposed algorithmic approach?” we also tried to

explain the development of the User Interface tool and its results. This tool helps to

take proper horizontal fragmentation decisions for database administrator or end user

at initial stage of database designing. This tool is also served to calculate frequencies

of the ISUD matrix and generate ALP table. After generating ALP table, a predicate

set, based on highest precedence value from ALP table for each relation is

constructed. By using predicate sets, a proper fragmentation decision is taken at

different sites according to the highest precedence value. For the development of the

tools, we have used C#.Net language as front end application and MS access

databases as a back end application according to the need of the case study (Bharat

Transport Service). The implementation of results is categorized into total value of

ALP and individual value of ALP which is expressed in graphical form in the

following sections.

4.2.2.1 The Retrieved Result of Total ALP Value from all the sites

Figure-17 is a user interface of CISUD application, which provides the functionality

to get CISUD (Insert, Select, Update, Delete) frequencies from the CISUD matrix

table in the database. These CISUD frequencies can be retrieved automatically by

selecting a combo box and this combo box is reserved for Total predicate name from

the relation in database. The frequencies appeared in textboxes are retrieved from the

CISUD matrix table in the database.

Figure 17: ISUD User Interface for Total cost of Attribute From all Sites

Page 56: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

56

The figure-17, also explains the total frequencies of all the sites. After getting the

frequencies in respective textboxes, the user can click the button “GetALPTable” and

retrieve the results by using the interface which can be shown in figure-18.

Figure 18: Results Retrieve for Total ALP(Attribute Locality Precedence) value from

three sites.

The figure-18 above, describes the ALP values at all three sites. It shows the total cost

value of each attribute occurrence of the case study application (Bharat Transport

Service) from all the three sites. This work also supported by ISUD technique in [24].

After the analysis of retrieved ALP values from all the sites, we can visualize from the

blue colored line attribute with highest value as compare to the rest of attributes ALP

values i.e. “Billnos”. By using this highest precedence ALP value of the attribute

“Billnos”, the database administrator can easily assess which attribute ALP value

coming from all sites is the highest. This attribute with highest total ALP value of the

attribute “Billnos” is chosen for calculating the ALP value at each site. The total ALP

value of different attributes of relations may help the database administrator to take

decision on the basis of the highest value of the attribute for getting individual value

of ALP from individual site.

4.2.2.2 The Retrieved Result of ALP Value from individual sites

This section explains the retrieval of ALP value resulting from individual sites and

also highlights the drawback of the technique explained in [24], because the

aforementioned technique does not explain the ALP value and the cost factor of

predicate at individual site. So, we have also contributed in our research work to

retrieve the ALP value resulting from the individual sites for the end-users, on the

basis of this ALP-value, the database administrator can decide to horizontally

fragment the relation according to the predicates used.

Page 57: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

57

Figure 19: ISUD User Interface for individual cost of Attribute from individual sites

We have given one of the solutions to get the ALP value from individual sites by User

interface as shown in figure-19. The user can select the ISUD frequencies from the

combo box individual predicate name and also select the site from the combo box

select site according to the end-users choice. The user can click the button

“IndividualALPTable” and retrieve the results by using the interface which can be

shown in figure-20. The following figure-20 shows the sample of an example.

Figure 20: Individual ALP Results from individual sites

In the above figure-21, we have shown the results of the predicate cost with its

attribute name and its value at specific sites. The blue colored lines make predicates

sets and explain the number of occurrence or value of ALP of the same attribute name

with different values at individual sites. After the analysis of the predicates sets, the

Page 58: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

58

database administrator can take the decision for the selection of the site for data

fragmentation and allocation.

Figure 21: Allocation of data to different sites

From the above figure we can see that, the retrieve results of ALP from different sites

can be allocated based on the highest ALP value of particular site.

4.2.2.3 Graphical Representation of the ISUD application Results

The graphical representation of ISUD application shows the interpretations of various

results which we have been achieved in the development phase in section 4. The

importance of the graphical representation of the ISUD applications helps to test the

linearity of the applications or algorithm by using various ISUD frequencies inputs.

We have taken two results as interpretation of ISUD application results as shown in

the following section.

4.2.2.3.1 Interpretation of Result-1

The variation of the ALP results is directly proportional to the number of predicates

and ISUD frequencies. The following mathematical expression1 which is consist of

an array that contains various numbers of predicates of respective individual

attributes.

totPredicates = new int[15] { 4, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0 };……………………………………………………………………..> Mathematical Expression1

In the following figure-22, different ISUD input values which we have taken in ISUD

entry interface can be seen. The changes in the ISUD inputs help to achieve different

results.

Figure 22: ISUD input values (1)

Site1 Site2

Site3

BillNos =2

BillNos =3

BillNos =1

Page 59: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

59

The following figure-23 depicts the interpretation of result1 of total ALP values with

respect to individual attribute name in tabular or two dimensional forms according to

ISUD input values as shown in figure-23.

Figure 23: Interpretation of Result 1

Figure-24 also shows the interpretation of result 1 in graphic form. This graphic form

gives the holistic view of the ISUD application results which explain the relation

between the various attributes and frequencies from all sites in distributed

environment. This graphic representation consists of two variables. One variable is

served for the name of attributes in x-axis and other variable is served for total

number of ALP values. This graphic notation in figure-30 also highlights the testing

and measurement of the algorithm or the proposed ISUD technique for practitioners

(database administrator or end-users) and for taking the decision at the initial state of

database fragmentation efficiently. As there are changes in the inputs, the results in

the graphic form also changes. Thus, due to different operational changes in the

results, the performance of the algorithm can be seen. In graphic interpretation of

result 1, we can assess that the attribute named “Billnos” ALP value is much greater

than the attribute named “NameofReceipent” from all three sites.

Figure 24: Graphical Interpretation of Result 1

Page 60: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

60

4.2.2.3.2 Interpretation of Result-2

This interpretation of result2 depicts another example with changes in total number of

predicates and changes in ISUD frequencies as shown in mathematical expression-2

and in figure-25.

totPredicates = new int[15] { 3, 2, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0 };……………………………………………………………………..> Mathematical Expression2

Figure 25: ISUD input values (2)

The procedure of interpretation of results, we have been already discussed in detail in

above section 4.2.2.3.1 for avoiding repetition. Similarly, the following figure-26

depicts the interpretation of result2 of total ALP values with respect to individual

attribute name in tabular or two dimensional forms according to ISUD input values as

shown in figure-26.

Figure 26: Interpretation of Result 2

Page 61: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Results

61

Figure 27: Graphical Interpretation of Result 2

The figure-27 also shows the interpretation of result2 in graphical form. This

graphical form gives also holistic view of the ISUD application results which explains

about the relation between various attributes and frequencies from all sites in

distributed environment which has already explained in interpretation of result1. In

graphical interpretation of result2, we can assess that the attribute named

“NameofReceipent” ALP value is much greater than attribute named “Billnos” from

all three sites.

Thus, from the above two interpretations of results in graphic form, which is one of

the solutions to justify efficiently the part of our second research question i.e. “How

to implement and test the proposed algorithmic approach?” But overall, the practical

results justify the research question “How to design the architecture of designated

algorithm from Q1?, How to implement and test the proposed algorithmic

approach?” efficiently for the reader’s benefit.

Page 62: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Discussion

62

5 Discussion In this dissertation, our objective has been to investigate the efficient algorithm or

technique for database fragmentation, develop and test the investigated algorithm in

distributed environment. We have constructed a comparative framework in section-

5.1 on the basis of some characteristics; distributed database designing at initial stage

for partitioning the relations, horizontal fragmentation algorithm, and affinity matrix

to build a predicate, relational database, complexity of the techniques, allocation

algorithm, performance and efficiency from the literature review. This comparative

framework highlights most efficient algorithm or technique for the development of

database fragmentation in distributed environment.

We have presented the proposed five layer architecture technique which is our

proposed technique and described different layers; application layers, database layers,

mediator layer, fragmentation layer and allocation layer. These layers perform various

functionalities which we have explained in detail in section 4.2.1. Similarly, the

implemented proposed technique can be useful for developing user interface for

calculating ALP values which can be useful for horizontal database fragmentation and

for testing the algorithm. The following section describes the main contribution of this

dissertation.

5.1 Contribution of the Work

Our customized approach gives motivation to the developers and end-users to justify

the aforementioned characteristics. The proposed ISUD approach can be one of the

solutions which help to fragment the database at the initial stage of distributed

database designing. This customized technique also gives confidence to the developer

for taking the decision for horizontal fragmentation by using specific algorithmic

approach.

The customized ISUD technique creates ALP table in the user interface with the help

of ISUD matrix table which is generated by logs files. This customized ISUD

technique also supports horizontal database fragmentation in distributed database

environment. The customized technique provides less complexity with respect to

execution for data fragmentation in distributed database systems as compared to the

previous techniques which we have been mentioned in table-1 and support efficiency,

performance with respect to cost factor and allocation phenomenon.

The structure of customized ISUD technique is taken from [24] and we have

developed it for the implementation is concerned in our research work, because this

technique provides theoretical solution for the horizontal database fragmentation. Our

proposed work, the customized ISUD technique enhances the features and also the

creation of individual ALP table from various individual sites because [24], only

emphasis on summarized total cost of attribute locality precedence (ALP) from all the

sites. Thus our customized ISUD technique also provides one of the solutions to the

database administrator to get the individual cost of attribute at each site, so that the

database administrator can take proper decisions for database fragmentation at the

concerned site where the attribute locality precedence (ALP) is maximum.

Page 63: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Discussion

63

We have also presented ISUD algorithm in graphic form which explains to the

database administrator who can analyze the performance of the algorithm by using

ISUD frequencies as input to the ISUD user interface application.

Page 64: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Conclusion and Future Work

64

6 Conclusion and Future Work

6.1 Conclusion This research work emphasizes the horizontal fragmentation phenomenon in

distributed environment in database industry. The main focus is on justifying the

theoretical purpose and the practical purpose in this research work. The theoretical

purpose is addressed to explore or search the existing algorithms which help to

uniformly fragment the relations over the distributed environment. The practical

purpose of this dissertation is to implement and develop the user interface application,

which helps for horizontal database fragmentation, and test the performance of the

selected algorithm, from the literature review in distributed database environment.

So the main contribution of our dissertation is to develop and implement, a 5-layer

architecture described in section 4.2.1, by using existing fragmentation technique

[24], which help to the database administrator or end-users to take efficient decisions

for fragmenting relations in distributed environment. To address the database

fragmentation complexity factors, we have proposed customized ISUD technique (5-

layer architecture) which is used efficiently as one of the solutions for the database

fragmentation in distributed environment.

We have conducted session meeting as high level method for data acquisition from

the domain experts and knowledge mentors for better understanding of the domain’s

problems. We have also chosen the development methodology design science

research (DSR) which is explained in section 3.4 for the development of customized

ISUD technique for relations fragmentation horizontally at the initial stage of

distributed database environment. This customized ISUD application or user interface

facilitates to calculate the total cost of an attribute from different sites and also

calculate individual cost of an attribute with respect to defined predicate at nominated

site.

One of the main objectives of this proposed customized ISUD technique is to show

the highest precedence value of the attribute (ALP value) in graphic form, it also

motivates the database administrator or end-users to take decisions for fragmenting

the relations at initial stage of distributed database environment. Thus by observing

the graphical statistics of ALP (Attribute locality precedence) table, we can easily

evaluate or measure the performance of the algorithm by having different operational

changes of inputs in ISUD frequencies.

We have discussed the different existing techniques or algorithms which we have

been explained in detail in section 2.5. These existing techniques or algorithms have

different pros and cons. So, after analysing the comparative framework of existing

techniques, we have suggested one of the solutions for fragmenting relations

efficiently in distributed database environment. Evaluating this technique, can be

useful with respect to performance, cost factor of the algorithm by using different

operational changes in the ISUD frequencies matrix.

Page 65: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Conclusion and Future Work

65

This dissertation can be considered as foundation for horizontal relation’s

fragmentation at initial stage of distributed database and also supports already

distributed database environment.

6.2 Future Work This research work has laid down the foundation for further work in the area of

vertical fragmentation and in heterogeneous distributed database environment. The

customized technique adopted in this research is also useful for the data allocation at

different sites based on the results obtained from the ALP (attribute locality

precedence) table by using ISUD application user interface. IT can be extended to

support fragmentation in distributed object oriented database.

Page 66: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

References

66

7 References [1] M. Tamer Özsu,(1998) (2011), “Principles of Distributed Database Systems”,

University of Waterloo, Ontario, Canada N2L 3G1.

[2] Gomer Thomas, Glenn R. Thompson, Chin-wan Chung, Edward Barkmeyer,

Fred Carter, Marjorie Tempeton, Stephen Fox, Berl Hartman, (1990),

“Heterogeneous Distributed Database Systems for Production Use”, © ACM

Computing Surveys, Vol. 22, No. 3.

[3] Amit P. Sheth, James A. Larson, (1990), “Federated Database Systems for

Managing Distributed Heterogeneous, and Autonomous Databases”, © ACM

Computing Surveys, Vol. 22, No. 3.

[4] Jacob Slonim, Dave Schmidt, Paul Fisher, (1979), “Considerations for

Determining the Degree of Centralization or Decentralization in the

Computing Environment”, © North-Holland Publishing Company, Information

& Management 2, ppt. 15-29, USA.

[5] Haroun Rababaah, “Distributed Databases Fundamentals and Research”,

Advanced Database – B561. Spring 2005. Dr. H. Hakimzadeh, Department of

Computer and Information Sciences, Indiana University South Bend.

[6] Jason Durbin and Lance Ashdown. Oracle8i Distributed Database

Systems, Release 2(8.1.6).Oracle Corporation,1999.

[7] Stephens Smith, Article on Accpac and It’s Databases in Stephen Smith's

Blog. http://smist08.wordpress.com/2010/07/10/accpac-and-it%E2%80%99s-

databases/

[8] Marton Trencseni, Attila Gazso (2009). "Keyspace: A Consistently Replicated,

Highly-Available Key-Value Store". http://scalien.com/whitepapers. Retrieved

2010-04-18.

[9] Mike Burrows (2006). "The Chubby Lock Service for Loosely-Coupled

Distributed Systems". http://labs.google.com/papers/chubby.html. Retrieved

2010-04-18.

[10] Dr. George Schussel, DCI's founder, is Chairman of Database & Client/Server

World and a world-renowned authority on information systems and

client/server technology. http://www.dciexpo.com/geos/replica.htm

[11] Ed Boyajian, President and Chief Executive Officer

http://www.enterprisedb.com/docs/en/8.4/repserver/Postgres_Plus_Advanced_

Server_Replication_Server_Users_Guide-08.htm#TopOfPage

[12] Advanced Database – B561. Spring 2005. Dr. H. Hakimzadeh Department of

Computer and Information SciencesIndiana University South Bend

[13] M. Tamer azsu, GTE Laboratories* Patrick Valduriez, INRIA Distributed

Database Systems:Where Are We Now.

[14] Huang Y.F., Chen J., (2001), “Fragment Allocation in Distributed Database

Design”, Journal of Information Science and Engineering, Vol. 17, ppt. 491-

506.

[15] Lindholm A., (2008), “A Constructive Study on Creating Core Business

relevant CREM Strategy and Performance Measures”, Printed by ©

Emerald Group, Facilities, Vol. 26, No. 7-8, pp. 343-358.

Page 67: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

References

67

[16] Fareedi A.A., (2010) “Ontology-based Model for the “Ward-round “ Process in

Healthcare (OMWRP)”, Master’s thesis, School of Engineering of Jönköping

University (2010).

[17] Baiao F., Mattoso M., Zaverucha G., (2002) “A Framework for the Design of

Distributed Databases”, Computer Science Department, COPPE/UFRJ, Federal

University of Rio de Janeriro, Brazil.

[18] Daudpota N.H., (1998) “Five Steps to Construct a Model of Data Allocation

for Distributed Database Systems”, Journal of Intelligent Information Systems,

Vol. 11, ppt. 153-168, Netherland.

[19] Son J.H, Kim M.H., (2003) “An Adaptable Vertical Partitioning Method in

Distributed Systems”, Journal of Systems and Software, Elsevier.

[20] Yee W.W.G, Donahoo M.J., Navathe S.B., (2000) “A Framework for Server

Data Fragment Grouping to Improve Server Scalability in Intermittently

Synchronized Databases”, CIKM.

[21] Hababeh I.O, Bowring N., (2003) “A Method for Fragment Allocation Design

in the Distributed Database Systems”, UGRU-4, The Sixth Annaul U.A.E

University Research Conference.

[22] Vaishnavi V.K, Kuechler Jr. W., (2008) “Design Science Research Methods

and Patterns: Innovating Information and Communication Technology”,

Auerbach Publications, Taylor and Francis Group, ISBN 978-1-4200-5932-8.

New York, USA.

[23] Dynamic Object Fragmentation and Replication Algorithm In Distributed

Database Systems by Azzam Sleit, Wesam AlMobaideen, Samih Al-Areqi, and

Abdulaziz Yahya, King Abdulla II School for Information Technology,

University of Jordan, Amman, Jordan.

[24] A New Technique for Database Fragmentation in Distributed Systems by

Shahidul Islam Khan and Dr. A. S. M. Latiful Hoque Department of Computer

Science & Engineering, Bangladesh University of Engineering & Technology.

[25] M. T. Ozsu and P. Valduriez, Principles of Distributed Database Systems, 3rd

ed., New Jersey: Prentice-Hall,2011

[26] S. Navathe, K. Karlapalem, and M. Ra, “A mixed fragmentation methodology

for initial distributed database design,” Journal of Computer and Software

Engineering Vol. 3, No. 4 pp 395–426, 1995.

[27] Leniel Braz de Oliveira Macaferi TOP-DOWN APPROACH IN

DISTRIBUTED DATABASES Barra Mansa ,November 2007.

[28] Hui Ma,”Distribution Design for Complex Value Databases”, dissertation

presented in partial fulfilment of the requirements for the degree of Doctor of

Philosophy in Information Systems at Massey University 2007.

[29] Ceri, S., and Pelagatti, G. Distributed Databases Principles and System.

McGraw- Hill, New York, 1984.

[30] S. B. Navathe, S. Ceri, G. Wiederhold, and J. Dour, “Vertical partitioning

algorithms for database design,” ACM Transactions on Database Systems

(TODS), Vol. 9, No. 4, pp. 680–710, 1984.

[31] C. H. Cheng, W. K. Lee, and K. F. Wong, “A genetic algorithm-based

clustering approach for database partitioning,” IEEE Transactions on Systems,

Man, and Cybernetics, Vol. 32, No. 3, pp. 215–230, 2002.

Page 68: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

References

68

[32] F. F. Marwa, I. E. Ali, A. A. Hesham, “A heuristic approach for horizontal

fragmentation and alllocation in DOODB,” in Proc. INFOS2008, 2008, pp. 9-

16.

[33] Hadj Mahboubi and J´erome Darmont University of Lyon (ERIC) “Enhancing

XML Data Warehouse Query Performance by Fragmentation,” in Proc. ACM

SAC09, 2009, pp.1555-1562.

[34] Wenfie Fan, Introduction to XML and Relational Databases,

http://homepages.inf.ed.ac.uk/wenfei/cs2/lecture/ln1.pdf, lecture note 1,PP.11-

17, Spring 2005.

[34] Ezeife, C.I. and Zheng, J, 1998. Measuring the Performance of Database Object

Horizontal Fragmentation Schemes, Supported by NSERC of Canada.

Page 69: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Appendix

69

8 Appendix:

8.1 Case study Application

Figure 28: Case Study Application

8.2 Log File Code for generating Customized ISUD matrix table

Log files are used to create the ISUD matrix table in the database. For each query this

log files function is attached whenever the queries are executed or run by using the

application interface.

public void writeToLogFile(string user_name, string query_name,string

attribute_name,string attribute_value, DateTime time)

{

System.Console.WriteLine("I am in writeToLogFile function.");

con = new OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0; Data

Source=C:\\Documents and Settings\\All

Users\\Documents\\mydatabasethesis\\bharattransportservice.mdb ");

System.Console.WriteLine("Database:["System.Environment.CurrentDirect

ory + "\\bharattransportservice.mdb]");

con.Open();

//Code for writing information to log file.

string logFile = "";

string logFile_Path = "";

string query = "insert into Access_Record values('" + user_name + "

','" + query_name + "','" + attribute_name + "','" + attribute_value

+ "','" + time + "')";

System.Console.WriteLine(query);

Page 70: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Appendix

70

cmd = new OleDbCommand(query, con);

cmd.CommandType = CommandType.Text;

int i = cmd.ExecuteNonQuery();

con.Close();

//Open the log file.

//Write the information

//Close the file.

}

8.3 Algorithm for ISUD application Interface

Below is a sample code of implementation of our proposed technique by using the

pseudo code algorithm of [24] in C#.Net windows application.

/*pseudo-Code to generate ALP */

for (int i = 0; i < totAttribues; i++)

{

for (int j = 0; j < totPredicates[i]; j++)

{

MAX[i, j] = 0;

for (int k = 0; k < totSites; k++)

{

System.Console.WriteLine("1..K:[" + k + "]");

for (int r = 0; r < totApplications[k]; r++)

{

//Calculations

C[i, j, k, r] = (fi * I) + (fs * S) + (fu

* U) + (fd * D);

SS[i, j, k] += C[i, j, k, r];

}//end of 4th loop

if (SS[i, j, k] > MAX[i, j]) //

{

MAX[i, j] = SS[i, j, k];

POS[i, j] = k;

}

System.Console.WriteLine("POS[i, j][" +

POS[i, j] + "]");

tempSite = k;

Dgsiteview.Rows.Add(k, AttNames[i],

totPredicates[i]);

//System.Console.WriteLine("2..K:[" + k +

"]");

SumOther = 0;

for (int r = 0; r < C[i, j, k, k]; r++)

{

if (r != k)

{

SumOther += SS[i, j, k];

}

// Dg.Rows.Add(cmbsites.SelectedItem,

AttNames[i], ALP[i]);

Page 71: The Customized Database Fragmentation …534993/FULLTEXT01.pdfdistributed database defines the same data models, schemas and databases but the heterogeneous distributed database depicts

Appendix

71

}

}//end of 3rd loop

ALPsingle[i, j] = SS[i, j, POS[i, j]] - SumOther;

// DGcostofPredofatt.Rows.Add(POS[i, j],

AttNames[i], ALPsingle[i, j]);

}//end of 2nd loop

ALP[i] = 0;

for (int j = 0; j < totPredicates[i]; j++)

{

ALP[i] += ALPsingle[i, j];

//Dgsiteview.Rows.Add(tempSite, AttNames[i],

ALP[i]);

System.Console.WriteLine("tempSite = " +

tempSite);

}

}//end of 1st loop

for (int i = 0; i < ALP.Length; i++)

{

//System.Console.WriteLine("ALP[" + i + "] = " +

ALP[i]);

//System.Console.WriteLine("|" + AttNames[i] + " | "

+ ALP[i] + "|");

//System.Console.WriteLine("--------------------");

DGcostofPredofatt.Rows.Add(AttNames[i], ALP[i]);

}

System.Console.WriteLine("ALP");

for (int i = 0; i < ALP.Length; i++)

{

}

}

End of ISUD application//