24
Distributed Databases by Chien-Pin Hsu CS157B Section 1 Nov 11, 2004 Dr. Sin-Min Lee

Distributed databases

Embed Size (px)

Citation preview

Page 1: Distributed databases

Distributed Databases

by Chien-Pin HsuCS157B Section 1

Nov 11, 2004

Dr. Sin-Min Lee

Page 2: Distributed databases

Distributed Database System

A distributed database system consists of loosely coupled sites that share no physical component

Appears to user as a single systemDatabase systems that run on each site are

independent of each otherProcessing maybe done at a site other than the

initiator of request

Page 3: Distributed databases

Homogenous Distributed Database Systems

All sites have identical software They are aware of each other and agree to

cooperate in processing user requests It appears to user as a single system

Page 4: Distributed databases

An Homogenous Distributed Database Systems example

A distributed system connects three databases: hq, mfg, and sales An application can simultaneously access or modify the data in

several databases in a single distributed environment.

Page 5: Distributed databases

What can we do?

A single query from a Manufacturing client on local database mfg can retrieve joined data from the products table on the local database and the dept table on the remote hq database.

For a client application, the location and platform of the databases are transparent.

Page 6: Distributed databases

Makes life easier!!

For example, if you are connected to database mfg but want to access data on database hq, creating a synonym on mfg for the remote dept table enables you to issue this query:

SELECT * FROM dept

In this way, a distributed system gives the appearance of native data access.

Users on mfg do not have to know that the data they access resides on remote databases.

Page 7: Distributed databases

Heterogeneous Distributed Database System

In a heterogeneous distributed database system, at least one of the databases uses different schemas and software.

A database system having different schema may cause a major problem for query processing.

A database system having different software may cause a major problem for transaction processing.

Page 8: Distributed databases

Distributed Data Storage

Replication– System maintains multiple copies of data, stored in

different sites, for faster retrieval and fault tolerance. Fragmentation

– Relation is partitioned into several fragments stored in distinct sites

Replication and fragmentation can be combined• Relation is partitioned into several fragments: system

maintains several identical replicas of each such fragment.

Page 9: Distributed databases

Advantages of Replication

Availability: failure of site containing relation r does not result in unavailability of r is replicas exist.

Parallelism: queries on r may be processed by several nodes in parallel.

Reduced data transfer: relation r is available locally at each site containing a replica of r.

Page 10: Distributed databases

Disadvantages of Replication

Increased cost of updates: each replica of relation r must be updated.

Increased complexity of concurrency control: concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented.

• One solution: choose one copy as primary copy and apply concurrency control operations on primary copy.

Page 11: Distributed databases

Fragmentation

Data can be distributed by storing individual tables at different sites

Data can also be distributed by decomposing a table and storing portions at different sites – called Fragmentation

Fragmentation can be horizontal or vertical

Page 12: Distributed databases

Why use Fragmentation?

Usage - in general applications use views so it’s appropriate to work with subsets

Efficiency - data stored close to where it is most frequently used Parallelism - a transaction can divided into several sub-queries to

increase degree of concurrency Security - data more secure - only stored where it is needed

Disadvantages:

Performance - may be slower

Integrity - more difficult

Page 13: Distributed databases

Horizontal Fragmentation

Each fragment, Ti , of table T contains a

subset of the rows Each tuple of T is assigned to one or more

fragments.Horizontal fragmentation is lossless

Page 14: Distributed databases

Horizontal Fragmentation Example

A bank account schema has a relation Account-schema = (branch-name, account-number, balance).

It fragments the relation by location and stores each fragment locally: rows with branch-name = `Hillside` are stored in the Hillside in a fragment

Page 15: Distributed databases

Vertical Fragmentation

Each fragment, Ti, of T contains a subset of the columns, each column is in at least one fragment, and each fragment includes the key:

Ti = attr_listi (T)

T = T1 T2 ….. Tn

All schemas must contain a common candidate key (or superkey) to ensure lossless join property.

A special attribute, the tuple-id attribute may be added to each schema to serve as a candidate key.

Page 16: Distributed databases

Vertical Fragmentation Example

A employee-info schema has a relation employee-info schema = (designation, name, Employee-id, salary).

It fragments the relation to put information in two tables for security concern.

Page 17: Distributed databases

Commit Protocols

Commit protocols are used to ensure atomicity across sites

Atomicity states that database modifications must follow an “all or nothing” rule.

a transaction which executes at multiple sites must either be committed at all the sites, or aborted at all the sites.

Page 18: Distributed databases

The Two-Phase Commit (2 PC) Protocol

What is this?

Two-phase commit is a transaction protocol designed for the complications that arise with distributed resource managers.

Two-phase commit technology is used for hotel and airline reservations, stock market transactions, banking applications, and credit card systems.

With a two-phase commit protocol, the distributed transaction manager employs a coordinator to manage the individual resource managers. The commit process proceeds as follows:

Page 19: Distributed databases

Phase1: Obtaining a Decision

Step 1 Coordinator asks all participants to prepare to commit transaction Ti.

Ci adds the records <prepare T> to the log and forces log to stable storage (a log is a file which maintains a record of all changes to the database)

sends prepare T messages to all sites where

T executed

Page 20: Distributed databases

Phase1: Making a Decision

Step 2 Upon receiving message, transaction manager at site determines if it can commit the transaction if not:

add a record <no T> to the log and send abort T message to Ci

if the transaction can be committed, then:

1). add the record <ready T> to the log

2). force all records for T to stable storage

3). send ready T message to Ci

Page 21: Distributed databases

Phase 2: Recording the Decision

Step 1 T can be committed of Ci received a ready T message from all the participating sites: otherwise T must be aborted.

Step 2 Coordinator adds a decision record, <commit T> or <abort T>, to the log and forces record onto stable storage. Once the record is in stable storage, it cannot be revoked (even if failures occur)

Step 3 Coordinator sends a message to each participant informing it of the decision (commit or abort)

Step 4 Participants take appropriate action locally.

Page 22: Distributed databases

Two-Phase Commit Diagram

Page 23: Distributed databases

Costs and Limitations

There have been two performance issues with two phase commit: – If one database server is unavailable, none of

the servers gets the updates. – This is correctable through network tuning

and correctly building the data distribution through database optimization techniques.

Page 24: Distributed databases