12
ICS 421 Spring 2010 Parallel & Distributed Databases Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 2/2/2010 1 Lipyeow Lim -- University of Hawaii at Manoa

ICS 421 Spring 2010 Parallel & Distributed Databases

  • Upload
    alec

  • View
    62

  • Download
    1

Embed Size (px)

DESCRIPTION

ICS 421 Spring 2010 Parallel & Distributed Databases. Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa. Why Parallel Data Access ?. SELECT * FROM mydata. 1 Terabyte (= 1024 GB). 10 MB/s. 1000 x parallel 10 MB/s. 1.2 days to scan. - PowerPoint PPT Presentation

Citation preview

Page 1: ICS 421 Spring 2010 Parallel & Distributed Databases

1

ICS 421 Spring 2010Parallel & Distributed Databases

Asst. Prof. Lipyeow LimInformation & Computer Science Department

University of Hawaii at Manoa

2/2/2010 Lipyeow Lim -- University of Hawaii at Manoa

Page 2: ICS 421 Spring 2010 Parallel & Distributed Databases

2

Why Parallel Data Access ?

2/2/2010 Lipyeow Lim -- University of Hawaii at Manoa

SELECT *FROM mydata

1 Terabyte(= 1024 GB)

10 MB/s

1.2 days to scan

1000 x parallel10 MB/s

1.5 mins to scan!

Page 3: ICS 421 Spring 2010 Parallel & Distributed Databases

3

Multi-Petabyte Databases

2/2/2010 Lipyeow Lim -- University of Hawaii at Manoa

How large is a petabyte?

Page 4: ICS 421 Spring 2010 Parallel & Distributed Databases

4

Parallel DBMS• eBay’s main Teradata data

warehouse (DW):– > 2 petabytes of user data– 10s of 1000s of users– Millions of queries per day– 72 nodes – >140 GB/sec of I/O, or 2

GB/node/sec• eBay’s Greenplum DW

– 6 1/2 petabytes of user data – 96 nodes – 200 MB/node/sec of I/O

• Walmart – 2.5 petabytes• Bank of America – 1.5 petabytes

• Some parallel DBMSs besides the usual Oracle-IBM-MS trio:– Teradata– Netezza– Vertica– DATAllegro– Greenplum– Aster Data– Infobright– Kognitio, Kickfire,

Dataupia, ParAccel, Exasol, ...

2/2/2010 Lipyeow Lim -- University of Hawaii at Manoa

Page 5: ICS 421 Spring 2010 Parallel & Distributed Databases

5

ParallelismPipeline parallelism• many machines each doing

one step in a multi-step process.

Partition parallelism• many machines doing the

same thing to different pieces of data.

2/2/2010 Lipyeow Lim -- University of Hawaii at Manoa

Node 2Node 1 Node 3

Q1:step1 Q1:step2 Q1:step3

Q2:step1 Q2:step2 Q2:step3

Q3:step1 Q3:step2 Q3:step3

Q4:step1 Q4:step2

Q5:step1

Node 2

Node 1

Node 3

Processing Query 1

Parallelism is natural to DBMS processing

Page 6: ICS 421 Spring 2010 Parallel & Distributed Databases

6

Parallelism Terminology

• Speed-up– Same job + more resources

= less time• Scale-up

– Bigger job + more resouces = same time

• Transaction scale-up – More clients + more

resources = same time

2/2/2010 Lipyeow Lim -- University of Hawaii at Manoa

degree of parallelism

Xac

t/sec

.(th

roug

hput

)

Ideal

degree of parallelism

sec.

/Xac

t(r

espo

nse

time)

Ideal

Page 7: ICS 421 Spring 2010 Parallel & Distributed Databases

7

Parallel Architecture : Share What?

2/2/2010 Lipyeow Lim -- University of Hawaii at Manoa

Shared Memory (SMP)

Shared Disk Shared Nothing (network)

CLIENTS CLIENTSCLIENTS

MemoryProcessors

Easy to programExpensive to buildDifficult to scaleup

Hard to programCheap to buildEasy to scaleup

Sequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2

Page 8: ICS 421 Spring 2010 Parallel & Distributed Databases

8

Different Types of DBMS Parallelism

• Intra-operator parallelism– get all machines working to compute a given

operation (scan, sort, join)• Inter-operator parallelism

– each operator may run concurrently on a different site (exploits pipelining)

• Inter-query parallelism– different queries run on different sites

• We’ll focus on intra-query parallelism

2/2/2010 Lipyeow Lim -- University of Hawaii at Manoa

Page 9: ICS 421 Spring 2010 Parallel & Distributed Databases

9

Parallel vs Distributed DBMS• A parallel database system

– Improve performance via parallelization of various operations such as loading data, building indexes, evaluating queries

• A distributed database system– Data is physically stored across several (geographical) sites– Each site is managed by an independent DBMS– Distribution governed by factors like local ownership &

increased availability• The boundaries of these traditional definitions are

blurring.

2/2/2010 Lipyeow Lim -- University of Hawaii at Manoa

Page 10: ICS 421 Spring 2010 Parallel & Distributed Databases

10

Types of Distributed DBMS• Homogeneous: Every site runs same type of

DBMS. – Parallel DBMSs are usually homogeneous

• Heterogeneous: Different sites run different DBMSs (different RDBMSs or even non-relational DBMSs).

2/2/2010 Lipyeow Lim -- University of Hawaii at Manoa

DBMS 1 DBMS 2 DBMS 3

Gateway

Page 11: ICS 421 Spring 2010 Parallel & Distributed Databases

11

Data Partitioning & Fragmentation• Parallel DB

– Data partitioning • Distributed DB

– Fragmentation• Same basic problem : How do we break up the data

(tables) and spread them amongst the “nodes” – Horizontal vs Vertical– Range vs Hash– Replication

• DB user’s view should be one single table.

2/2/2010 Lipyeow Lim -- University of Hawaii at Manoa

TID

t1t2t3t4

Page 12: ICS 421 Spring 2010 Parallel & Distributed Databases

12

Automatic Data Partitioning

• Shared disk and memory less sensitive to partitioning, • Shared nothing benefits from "good" partitioning

2/2/2010 Lipyeow Lim -- University of Hawaii at Manoa

Partitioning a table: Range Hash Round Robin

A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z

Good for equijoins, range queriesgroup-by

Good for equijoins Good to spread load