Parallel and Distributed Databases CS263 Lecture 16

Parallel and Distributed Databases

• CS263 Lecture 16

LECTURE PLAN

Parallel DBMS - What and Why?

What is a Client/Server DBMS?

Why do we need Distributed DBMSs?

Date’s rules for a Distributed DBMS

Benefits of a Distributed DBMS

Issues associated with a Distributed DBMS

Disadvantages of a Distributed DBMS

PARALLEL DATABASE SYSTEM

PARALLEL DBMSsWHY DO WE NEED THEM?

• More and More Data!

We have databases that hold a high amount of data, in the order of 1012 bytes:

10,000,000,000,000 bytes!

• Faster and Faster Access!

We have data applications that need to process data at very high speeds:

10,000s transactions per second!

SINGLE-PROCESSOR DBMS AREN’T UP TO THE JOB!

Improves Response Time.

INTERQUERY PARALLELISM

It is possible to process a number of transactions in parallel with each other.

Improves Throughput.

INTRAQUERY PARALLELISM

It is possible to process ‘sub-tasks’ of a transaction in parallel with each other.

PARALLEL DBMSsBENEFITS OF A PARALLEL DBMS

Speed-Up.

As you multiply resources by a certain factor, the time taken to execute a transaction should be reduced by the same factor:

10 seconds to scan a DB of 10,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs

PARALLEL DBMSsHOW TO MEASURE THE BENEFITS

Scale-up.

As you multiply resources the size of a task that can be executed in a given time should be increased by the same factor.

1 second to scan a DB of 1,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs

Sub-linear speed-up

Linear speed-up (ideal)

Number of CPUs

1000/Sec

5 CPUs

2000/Sec

10 CPUs 16 CPUs

1600/Sec

PARALLEL DBMSsSPEED-UP

10 CPUs2 GB Database

Number of CPUs, Database size

Linear scale-up (ideal)

Sub-linear scale-up

1000/Sec

5 CPUs1 GB Database

900/Sec

PARALLEL DBMSsSCALE-UP

MEMORYCPU

Shared Memory – Parallel Database Architecture

Shared Disk – Parallel Database Architecture

Shared Nothing – Parallel Database Architecture

MAINFRAME DATABASE SYSTEM

NTERMINALSMAINFRAME COMPUTER

PRESENTATION LOGICBUSINESS LOGICDATA LOGIC

CLIENT/SERVER DATABASE SYSTEM

CLIENT/SERVER DBMS

Manages user interface

Accepts user data

Processes application/business logic

Generates database requests (SQL)

Transmits database requests to server

Receives results from server

Formats results according to application logic

Present results to the user

CLIENT PROCESS

CLIENT/SERVER DBMS

Accepts database requests

Processes database requests

Performs integrity checks

Handles concurrent access

Optimises queries

Performs security checks

Enacts recovery routines

Transmits result of database request to client

SERVER PROCESS

Data Request Data Response

CLIENT/SERVERCLIENT/SERVERDBMS ARCHITECTUREDBMS ARCHITECTURE

CLIENT#1

CLIENT#2

CLIENT#3

PRESENTATION LOGIC

BUSINESS LOGIC

DATA LOGIC

(FAT CLIENT)

D/BASE

SERVER

D/BASE

SERVER

Data Request Data Response

CLIENT/SERVERCLIENT/SERVERDBMS ARCHITECTUREDBMS ARCHITECTURE

CLIENT#1

CLIENT#2

CLIENT#3

PRESENTATION LOGIC

BUSINESS LOGICDATA LOGIC

(THIN CLIENT)

CLIENT

CLIENT CLIENT

CLIENT

Leyton

CLIENT

CLIENT CLIENT

Stratford

Barking Leytonstone

DISTRIBUTED PROCESSING ARCHITECTUREDISTRIBUTED PROCESSING ARCHITECTURE

CLIENT

DISTRIBUTED DATABASE SYSTEM

A distributed database system is a collection of logically related databases that co-operate in a transparent manner.

Transparent implies that each user within the system may access all of the data within all of the databases as if they were a single database

There should be ‘location independence’ i.e.- as the user is unaware of where the data is located it is possible to move the data from one physical location to another without affecting the user.

DISTRIBUTED DATABASESWHAT IS A DISTRIBUTED DATABASE?

CLIENT CLIENT

DISTRIBUTED DATABASE ARCHITECTUREDISTRIBUTED DATABASE ARCHITECTURE

CLIENT CLIENT

Leytonstone

CLIENT CLIENT

CLIENT

Stratford

CLIENT

CLIENT CLIENT

CLIENT

Barking

CLIENT

Leyton

D/BASE

SERVER #1CLIENT

D/BASE

SERVER #2

CLIENT#2

CLIENT#3

M:N CLIENT/SERVER DBMS ARCHITECTUREM:N CLIENT/SERVER DBMS ARCHITECTURE

NOT TRANSPARENT!NOT TRANSPARENT!

DB Computer Network

Site 2

Site 1

DC LDBMS

LDBMS = Local DBMS DC = Data Communications GSC = Global Systems Catalog DDBMS = Distributed DBMS

COMPONENTS OF A DDBMS

• Reduced Communication Overhead

Most data access is local, less expensive and performs better.

• Improved Processing Power

Instead of one server handling the full database, we now have a collection of machines handling the same database.

• Removal of Reliance on a Central Site

If a server fails, then the only part of the system that is affected is the relevant local site. The rest of the system remains functional and available.

DISTRIBUTED DATABASESADVANTAGES

• Expandability

It is easier to accommodate increasing the size of the global (logical) database.

• Local autonomy

The database is brought nearer to its users. This can effect a cultural change as it allows potentially greater control over local data .

DISTRIBUTED DATABASESADVANTAGES

A distributed system looks exactly like a non-distributed system to the user!

1. Local autonomy2. No reliance on a central site3. Continuous operation4. Location independence5. Fragmentation independence6. Replication independence7. Distributed query independence8. Distributed transaction processing9. Hardware independence10. Operating system independence11. Network independence12. Database independence

DISTRIBUTED DATABASESDATE’S TWELVE RULES FOR A DDBMS

Data Allocation

Data Fragmentation

Distributed Catalogue Management

Distributed Transactions

Distributed Queries – (see chapter 20)

DISTRIBUTED DATABASESISSUES

1. Locality of reference Is the data near to the sites that need it?

2. Reliability and availability Does the strategy improve fault tolerance and accessibility?

3. Performance Does the strategy result in bottlenecks or under-utilisation of resources?

4. Storage costs How does the strategy effect the availability and cost of data storage?

5. Communication costs How much network traffic will result from the strategy?

DISTRIBUTED DATABASESDATA ALLOCATION METRICS

CENTRALISED

DISTRIBUTED DATABASESDATA ALLOCATION STRATEGIES

Locality of Reference

Reliability/Availability

Storage Costs

Performance

Communication Costs

Lowest

Unsatisfactory

Highest

PARTITIONED/FRAGMENTED

Storage Costs

Performance

Communication Costs

Low (item) – High (system)

Lowest

Satisfactory

COMPLETE REPLICATION

Storage Costs

Performance

Communication Costs

Highest

High (update) – Low (read)

SELECTIVE REPLICATION

Storage Costs

Performance

Communication Costs

Average

Satisfactory

Low (item) – High (system)

Usage Applications are usually interested in ‘views’ not whole relations.

Efficiency It’s more efficient if data is close to where it is frequently used.

Parallelism It is possible to run several ‘sub-queries’ in tandem.

Security Data not required by local applications is not stored at the local site.

DISTRIBUTED DATABASESWHY FRAGMENT DATA?

DISTRIBUTED DATABASESHORIZONTAL DATA FRAGMENTATION

333.00STRATFORDKHAN456

500.00BARKINGONO400

340.14BARKINGGREEN350

23.17STRATFORDSMITH345

200.00BARKINGGRAY324

1000.00STRATFORDJONES200

BALANCEBRANCHCUSTOMERACCOUNT

Horizontal Fragmentation: Consists of a Restriction on a Relation.

e.g., ( branch = ‘Stratford’ Account)

DISTRIBUTED DATABASESHORIZONTAL DATA FRAGMENTATION

STRATFORD

333.00KHAN456

23.17SMITH345

1000.00JONES200

BALANCEBRANCHCUSTOMERACCT NO.

BARKING

500.00ONO400

340.14GREEN350

200.00GRAY324

BALANCEBRANCHCUSTOMERACCT NO.

STRATFORD BRANCH

BARKING BRANCH

DISTRIBUTED DATABASESVERTICAL DATA FRAGMENTATION

KJTR78KHA456T0208-500-5821STRATFORDKHAN456

ZZEE56GRA324S0208-545-7528BARKINGGRAY324

XXYY22JON200T0208-500-9000STRATFORDJONES200

PASSWORDLOGINPHONE NOSITENAMES#

Vertical Fragmentation: Consists of a Projection on a Relation.

e.g., ( S#, NAME, SITE, PHONE NO Student)

DISTRIBUTED DATABASESVERTICAL DATA FRAGMENTATION

STRATFORD

BARKING

STRATFORD

KHAN456

GRAY3240208-500-5821

0208-545-7528

0208-500-9000JONES200

PHONE NO.SITENAMES#

KJTR78

ZZEE56

XXYY22

KHA456T456

GRA324S324

JON200T200

PASSWORDLOGIN-IDS#

STUDENT ADMINISTRATION

NETWORK ADMINISTRATION

DISTRIBUTED DATABASESDISTRIBUTED CATALOG MANAGEMENT

• Centralised Global Catalog

One site maintains the full global catalog. All changes to any local system catalog have to be propagated to the site maintaining the global catalog. Bad performance, single point of failure, compromises site autonomy.

• Dispersed Catalog

There is no physical global catalog. Each time a remote data item is required, the catalogues from ALL other sites are examined for the item. This has severe performance penalties.

DISTRIBUTED DATABASESDISTRIBUTED CATALOG MANAGEMENT

• Replicated Global Catalog

Each site maintains its own global catalog. Although this greatly speeds up remote data location, it is very inefficient to maintain. A detail of every data item added, changed or deleted locally has to be propagated to ALL other sites .

• Local-Master Catalog

Each site maintains both its local system catalog as well as a catalog of all of its data items that are replicated at other sites. This avoids compromising site autonomy, is fairly efficient, and is not a single point of failure.

DISTRIBUTED DATABASESDISTRIBUTED TRANSACTIONS

Stratford DB

Barking DB

Leyton DB

StratfordDBMS

StratfordClient

BarkingDBMS

LeytonDBMS

Global Transaction

(a) Debit Stratford A/C £500(b) Credit Barking A/C £350(c) Credit Leyton A/C £150

TWO-PHASE COMMIT (2PC) - OK

TWO-PHASE COMMIT (2PC) - ABORT

‘Global Abort’

Architectural complexity.

Security.

Integrity control more difficult.

Lack of standards.

Lack of experience.

Database design more complex.

DISTRIBUTED DATABASESDISADVANTAGES OF DDBMSs

Parallel and Distributed Databases CS263 Lecture 16

Documents

20 parallel databases

21 Parallel Databases

ICS 421 Spring 2010 Parallel & Distributed Databases

Chapter 21: Parallel Databases. 21.2 Chapter 21: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation

UNIT-5 DISTRIBUTED & PARALLEL DATABASES · UNIT-5 DISTRIBUTED & PARALLEL DATABASES 5.1 CONCURRENCY CONTROL In a multiprogramming environment where multiple transactions can be executed

Web Databases CS263 Lecture 13. 2 The Internet environment Following Fig. Shows the basic environment needed to set up both Intranet and Internet database-enabled

Parallel Databases Pres

Chapter 21: Parallel Databases

Unit1-Parallel& Distributed Databases

Lecture 24: Transactions (wrap up) Parallel Databases

PARALLEL & DISTRIBUTED DATABASES - Academicsweb.cs.wpi.edu/~cs561/s12/Lectures/4-5/ParallelDBs.pdf · PARALLEL & DISTRIBUTED DATABASES 1 . ... recovery mechanisms ... WHY PARALLEL

Optimizing Across Relational and Linear Algebra in Parallel … · Outline •Introduction and Motivation •Databases vs. parallel dataflow systems •Declarativityin Parallel-dataflow

Lecture 10: Parallel Databases

Parallel Databases Michael French, Spencer Steele, Jill Rochelle

CS263: Wireless Communications and Sensor Networksfaculty.kfupm.edu.sa/COE/marwan/richfiles/intro.pdf · CS263: Wireless Communications and Sensor Networks Matt ... MEMS sensors WeC

Chapter 17: Parallel Databases - IIT Bombaysudarsha/db-book/slide-dir/ch21.pdf · Database System Concepts, 5th Ed. ... Parallel Databases ... key, tuples will be equally distributed

CSE 544 Parallel Databases

Scaling Multicore Databases via Constrained Parallel ExecutionScaling Multicore Databases via Constrained Parallel Execution TR2016-981 Zhaoguo Wang, Shuai Mu, Yang Cui, Han Yi †,

Parallel & distributed databases

Parallel Databases; Map- Reduce