12
SQMD: Architecture for Scalable, Distributed Database System built on Virtual Private Servers e-Science for cheminformatics and drug discovery 4th IEEE International Conference on e-Science 2008 Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University [email protected] , [email protected] Rajarshi Guha School of Informatics, Indiana University [email protected]

Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University [email protected]@indiana.edu, [email protected]@indiana.edu

Embed Size (px)

Citation preview

Page 1: Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University kakim@indiana.edukakim@indiana.edu, marpierc@indiana.edumarpierc@indiana.edu

SQMD: Architecture for Scalable, Distributed Database System

built on Virtual Private Servers

e-Science for cheminformatics and drug discovery 4th IEEE International Conference on e-Science 2008

Kangseok Kim, Marlon E. PierceCommunity Grids Laboratory, Indiana University

[email protected], [email protected]

Rajarshi GuhaSchool of Informatics, Indiana University

[email protected]

Page 2: Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University kakim@indiana.edukakim@indiana.edu, marpierc@indiana.edumarpierc@indiana.edu

Introduction Huge increase in the size of datasets in a variety of fields, e.g.

Scientific observations for e-Science Sensors (video, environmental) Data fetched from Internet defining users interests Need data management and partitioning and processing strategies that are

scalable We also need to find effective ways to use our overabundance of comput-

ing power. Cloud computing and virtualization The partitioning of database over virtual private servers can be a critical factor

for scalability and performance. The purpose of the virtual private servers’ use is to facilitate concurrent access

to individual applications (databases) residing on multiple virtual platforms on a single or multiple physical machines with effective resources’ use and management, as compared to an application (database) on a physical machine

Page 3: Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University kakim@indiana.edukakim@indiana.edu, marpierc@indiana.edumarpierc@indiana.edu

Distributed database system built on virtual private servers

Database system is composed of three tiers web service (WS) client (front-end) web service and message service system (middleware) agents and a collection of databases (back-end)

Distributed database system allows WS clients to access data from databases dis-tributed over virtual private servers.

Databases are distributed over multiple virtual private servers by fragmenting data using two different methods: data clustering horizontal (or equal) partitioning

The distributed database system is a network of two or more PostgreSQL databases that reside on one or more virtual private servers. Lab uses 8 virtual private servers over one physical machine with OpenVZ vir-

tualization technology WS client can simultaneously access (or query) the data in several databases in a

single distributed environment. SQMD (Single Query Multiple Database) mechanism which transmits a single

query that synchronously operates on multiple databases, using publish/sub-scribe paradigm.

Page 4: Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University kakim@indiana.edukakim@indiana.edu, marpierc@indiana.edumarpierc@indiana.edu

Scalable distributed database system architecture

Query/Response

Message / ServiceSystem (Broker)

Web ServiceMessage Service

WS Client(Front-end User Interface)

Web Server

Query/Response Query/Response

Query/Response

Query/Response

DB Host Server

DB Agent(JDBC to PostgreSQL)

Topics:1. Query/Response2. Heart-beat

DB Host Server

DB Agent(JDBC to PostgreSQL)

DB Host Server

DB Agent(JDBC to PostgreSQL)

Page 5: Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University kakim@indiana.edukakim@indiana.edu, marpierc@indiana.edumarpierc@indiana.edu

Example query and Total number of hits for varying R (distance cutoff)

SELECT cid, structure FROM pubchem_3d WHERE cube_enlarge ( COORDS, R, 12 ) @> momsim

cid – compoundIDPubchem_3d – 3D structure for public repository of chemical information

including connection tables, properties and biological assay resultsCOORDS - 12-D shape descriptor of query molecule R – user specified distance cutoff to retrieve those points from the database

whose distance to the query pointcube_enlarge - PostgreSQL function that generates the bounding hypercube from the

query point momsim - 12-D CUBE field

The example query means to find all rows of the database for which the 12-D shape de-scriptor lies in the hypercubical region defined by cube_enlarge

Total number of hits for varying R, using the above queryR 0.3 0.4 0.5 0.6 0.7

Total number of response data

495 6,870 37,049 113,123 247,171

Size in bytes 80,837 1,121,181 6,043,337 18,447,438 40,302,297

Page 6: Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University kakim@indiana.edukakim@indiana.edu, marpierc@indiana.edumarpierc@indiana.edu

Total latency = Transit cost (Tclient2ws) + Web service cost (Tws2db)

T query T response

T client2ws

(Transit cost)

T ws2db

(WS cost)

………….………

…………………..

Web Service (WS) Client

WS

Broker

DB Agent DB Agent

Total latency = T

client2ws +

Tw

s2db

Taggregation

Tagent2db

Page 7: Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University kakim@indiana.edukakim@indiana.edu, marpierc@indiana.edumarpierc@indiana.edu

Mean query response time in a centralized (not fragmented) database

0.3 0.4 0.5 0.6000000000000

01

0.7000000000000

01

0

5000

10000

15000

20000

25000

30000

Mean query response time in a centralized database

Network costAggregation costQuery processing cost

Distance R

Mea

n tim

e in

mill

isec

onds

time to transmit a query (Tquery) to and receive a response (Tresponse) from the web service running on web server

time spent in the web service for serially aggregating responses from databases

time between submitting a query from an agent to and retrieving the responses of the query from a database server including the corresponding execution time of the agent

As the distance R increases, the time needed to perform a query in the database increases since the size of result set increases and thus the query processing cost clearly becomes the biggest portion of the total cost.

Page 8: Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University kakim@indiana.edukakim@indiana.edu, marpierc@indiana.edumarpierc@indiana.edu

8

Performance analysis

We show the performance of a query/response interac-tion mechanism between a client and distributed data-bases, with overheads associated with virtualization de-ployments compared to real (physical) host deployments, and also with two different data partitioning strategies – horizontal partitioning vs. data clustering.

In our experiment with virtual private servers in case of using data clustering method

we allocated the memory into each virtual server in proportion to the size of each cluster

in case of using horizontal partitioning method we allocated the memory into each server in same size

Page 9: Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University kakim@indiana.edukakim@indiana.edu, marpierc@indiana.edumarpierc@indiana.edu

mean query response time in a centralized database system

Speedup = --------------------------------------------------------------------------------------------------------

mean query response time in a distributed database system

0.3 0.4 0.5 0.600000000000001

0.700000000000001

00.5

11.5

22.5

33.5

44.5

5Speedup

Horizontal partitioning over physical machinesHorizontal partitioning over virtual private serversData clustering over physical machines

Distance R

Spe

edup

Using horizontal partitioning is faster than using data clustering since fragments partitioned by the data clustering method can be different in the number of dataset.

Page 10: Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University kakim@indiana.edukakim@indiana.edu, marpierc@indiana.edumarpierc@indiana.edu

Mean query processing timein each cluster (R = 0.5)

1 2 3 4 5 6 7 80

500

1000

1500

2000

2500

Mean query processing time in each cluster (Tagen-t2db)

(R = 0.5)Data clustering over physical machinesData Clustering over vpsHorizontal partitioning over physical machines

Cluster numberMe

an

tim

e in

mill

ise

con

ds

As the responses occurred in performing a query in a large size of cluster increase, the time needed to perform the query in the cluster increases as well. In other words the total active (hash) index set for the query increases as the distance R increase. To avoid as much disk access as possible and thus to improve the query processing performance, the total index set is needed to fit in main memory.

Page 11: Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University kakim@indiana.edukakim@indiana.edu, marpierc@indiana.edumarpierc@indiana.edu

Mean query response time

0.3 0.4 0.5 0.600000000000001

0.700000000000001

0

5000

10000

15000

20000

25000

Mean query response time by data clustering over physical machines

Network costAggregation costQuery processing cost

Distance R

Mea

n tim

e in

milli

seco

nds

0.3 0.4 0.5 0.600000000000001

0.700000000000001

02000400060008000

100001200014000

Mean query response time by horizontal partitioning over physical machines

Network costAggregation costQuery processing cost

Distance R

Mea

n tim

e in

milli

seco

nds

0.3 0.4 0.5 0.600000000000001

0.700000000000001

0

5000

10000

15000

20000

25000

Mean query response time by data clustering over virtual private servers

Network costAggregation costQuery processing cost

Distance R

Me

an

tim

e in

mill

ise

co

nd

s

0.3 0.4 0.5 0.600000000000001

0.700000000000001

0

2000

4000

6000

8000

10000

12000

14000

Mean query response time by horizontal partitioning over virtual private servers

Network costAggregation costQuery processing cost

Distance R

Me

an

tim

e in

mill

ise

con

ds

Page 12: Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University kakim@indiana.edukakim@indiana.edu, marpierc@indiana.edumarpierc@indiana.edu

Summary and Future work SQMD mechanism, based on publish/subscribe paradigm, transmits a single

query that simultaneously operates on multiple databases, hiding the details about data distribution in middleware to provide the transparency of the dis-tributed databases to heterogeneous web service clients.

The results of experiments with our distributed system indicate the perfor-mance in using virtual private servers on a machine (host) is comparable to that in using eight physical machines (hosts).

In future work, we need to decrease the workload for aggregating the results of a query in web service.

We will investigate the use of the M-tree index. M-tree indexes allow one to perform queries using hyperspherical regions. This would allow us to avoid the extra hits we currently obtain due to the hypercube represen-

tation. To eliminate the unnecessary query processing with some databases distrib-

uted by the data clustering method, we should consider the query optimization that allows a query to localize into some specific databases in future work.

In future work we will extend the evaluation for the (optimized) effective use of other resources as well as memory with our distributed database system.