57
Data Partitioning in VLDB Tal Olier [email protected]

Data Partitioning in VLDB

  • Upload
    mikasi

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Data Partitioning in VLDB. Tal Olier [email protected]. Why am I here?. Tal Olier – [email protected] ~15 years in various software development positions. All of them involved database practice. - PowerPoint PPT Presentation

Citation preview

Page 1: Data Partitioning in VLDB

Data Partitioning in VLDB

Tal [email protected]

Page 2: Data Partitioning in VLDB

Why am I here?

Tal Olier – [email protected]

~15 years in various software development positions. All of them involved database practice.

I work in HP Software, I love working there and I came to tell you this; the lecture is just an excuse getting me into the building :)

Page 3: Data Partitioning in VLDB

Agenda

• RDBMS in short (basic terms)• SQL reminder• A bit about (RDBMS) architecture• Performance - access paths • What is table join• VLDB - the size factor• VLDB - industry practice• How joins are executed• Summary

Page 4: Data Partitioning in VLDB

Relational Database Management System

Page 5: Data Partitioning in VLDB

A little history

• Was invented in 1970– By Edgar Frank "Ted" Codd– In IBM labs– Oracle emerged first to the market

Page 6: Data Partitioning in VLDB

Basics – a table

• Rows • Columns• Primary key

Emp_id Emp_name Salary

1 Dany 10,000

2 Yosi 20,000

3 Moshe 30,000

4 Eli 40,000

Page 7: Data Partitioning in VLDB

Basics – a relation

• A foreign key (constraint)• A reference

– Source table– Source column/s– Target table– Target column/s

Page 8: Data Partitioning in VLDB

People example

• People: name, height, smoking, father• Books read: title, author• Schedule details: from, to, activity• Resume details: from, to, salary

Page 9: Data Partitioning in VLDB

People example

Page 10: Data Partitioning in VLDB

Structured Query Language

Page 11: Data Partitioning in VLDB

Query language

• SQL – Structured Query Language• Declarative (vs. procedural)• Requires Internal optimization

Page 12: Data Partitioning in VLDB

SELECT query structure

• SELECT • FROM… JOIN• WHERE • GROUP BY• HAVING• ORDER BY

Page 13: Data Partitioning in VLDB

SQL modules

• DML (+Select) – Data manipulation language• DDL – Data definition language• TC – Transaction controls (commit/rollback)• DCL – Data control language (grant/revoke)• PE – Procedural extensions

Page 14: Data Partitioning in VLDB

A bit about architecture

Page 15: Data Partitioning in VLDB

Database server

MemoryProcess

I/O System

Client Process

Data Files Log Files

Server ProcessBuffer cache Log

cacheOther cache

Everything is blocks

Page 16: Data Partitioning in VLDB

IO bound vs. CPU bound

• CPU – what is it consumed for?• IO – what is it consumed for?

Page 17: Data Partitioning in VLDB

Performance?

Page 18: Data Partitioning in VLDB

FTS – full table scan

• Scan the whole table – from top to bottom200720082009201020072008200920102007200820092010

Page 19: Data Partitioning in VLDB

B Tree index

• B tree – allows great spanning that derives small tree height

Page 20: Data Partitioning in VLDB

B+ tree

• The leaves are organized in a doubly linked list• B+ tree – allows searching through all values by

searching the leaf level only

Page 21: Data Partitioning in VLDB

Database index

• Data is sorted according to the index columns• The leaf contain pointers to rows in the table• Search of 1 value in a tree - o (log n)• Smaller index height in B+ trees• Index (database) operations:

– Add/remove values– Index seek– Index scan

Page 22: Data Partitioning in VLDB

Index seek/scan

200720082009201020072008200920102007200820092010…

Page 23: Data Partitioning in VLDB

Join (logical)

Page 24: Data Partitioning in VLDB

Inner join

• Use join predicate to match rows from 2 table: A and B

• Each row in table A is compared to each row in table B to find the pairs of rows that satisfy the join predicate

• Than column values for each matched pairs are combined into a result row

Page 25: Data Partitioning in VLDB

dept_id Dept_name

1 Sales2 Engineering3 Marketing

department

employee

Emp_name Dept_idRina 1

Moshe 2

Shira 2

Yossi null

emp_dept_id

emp_name dept_dept_id

dept_name

1 Rina 1 Sales

1 Rina 2 Engineering

1 Rina 3 Marketing

2 Moshe 1 Sales

2 Moshe 2 Engineering

2 Moshe 3 Marketing

2 Shira 1 Sales

2 Shira 2 Engineering

2 Shira 3 Marketing

null Yossi 1 Sales

null Yossi 2 Engineering

null Yossi 3 Marketing

Cartesian product

Page 26: Data Partitioning in VLDB

Equi join

• A inner join that uses equality comparison in the join predicate

• Example:select * from employee emp join department dept on emp.dept_id = dept.dept_id

Page 27: Data Partitioning in VLDB

Equi join

OK

OK

OK

emp_dept_id

emp_name dept_dept_id

dept_name

1 Rina 1 Sales

1 Rina 2 Engineering

1 Rina 3 Marketing

2 Moshe 1 Sales

2 Moshe 2 Engineering

2 Moshe 3 Marketing

2 Shira 1 Sales

2 Shira 2 Engineering

2 Shira 3 Marketing

null Yossi 1 Sales

null Yossi 2 Engineering

null Yossi 3 Marketing

Page 28: Data Partitioning in VLDB

RDBMS – summary in a nutshell

• Tables• References• Joins• Indexes• Blocks• I/O

Page 29: Data Partitioning in VLDB

Very Large Data Base

Page 30: Data Partitioning in VLDB

RDBMS – summary in a nutshell

• Tables• References• Indexes

• Blocks• I/O

Page 31: Data Partitioning in VLDB

VLDB – a table – size factor

Page 32: Data Partitioning in VLDB

Use case: Sales Information

• Table:– Customer name– Order number– Order date and time– List of items, amount and prices

Page 33: Data Partitioning in VLDB

Order details (2007-2010)

200720082009201020072008200920102007200820092010

Page 34: Data Partitioning in VLDB

Remove 2007’s orders

200720082009201020072008200920102007200820092010…

Page 35: Data Partitioning in VLDB

Order details kept in 4 tables

20072007…

20082008…

20092009…

20102010…

Page 36: Data Partitioning in VLDB

… 4 tables – remove 2007’s data

20072007…

20082008…

20092009…

20102010…

Page 37: Data Partitioning in VLDB

Union view

Select * from t2007Union all Select * from t2008Union all Select * from t2009Union all Select * from t2010

Page 38: Data Partitioning in VLDB

Order details kept in 4 tables and a view

20072007…

20082008…

20092009…

20102010…

Page 39: Data Partitioning in VLDB

Partitioned table

20072007…

20082008…

20092009…

20102010…

Page 40: Data Partitioning in VLDB

Get back to: Remove 2007’s data?

20072007…

20082008…

20092009…

20102010…

Page 41: Data Partitioning in VLDB

Impact on index behavior

200720082009201020072008200920102007200820092010…

20072007…

20082008…

20092009…

20102010…

Page 42: Data Partitioning in VLDB

Partitioned index (local index)

200720082009201020072008200920102007200820092010…

20072007…

20082008…

20092009…

20102010…

Page 43: Data Partitioning in VLDB

Local indexes

• Index is bound to it’s partition• Drop partition derives drop index• Smaller index heights• Index is always usable• Harder to maintain uniqueness with it

Page 44: Data Partitioning in VLDB

Partitioned table - concepts

• Partition column is the key for dividing the data

• Performance – only relevant partitions used• Add/drop partition – DDL• Local index – index is bound to a partition

Page 45: Data Partitioning in VLDB

Star schema

Page 46: Data Partitioning in VLDB

Data tables

block blockblockblockblockblockblockblock blockblockblockblockblockblockblock blockblockblockblockblockblockblock blockblockblockblockblockblockblock blockblockblockblockblockblockblock blockblockblockblockblockblockblock blockblockblockblockblockblockblock blockblockblockblockblockblock

T a b l e - A T

a b

l e -

B

T a b l e - C

Page 47: Data Partitioning in VLDB

Let’s get back to our partitioned table

20072007…

20082008…

20092009…

20102010…

Page 48: Data Partitioning in VLDB

Dimension referencing

Year Cust_id …

2007 1715

Cust_id Name Serial signature …

1715 Bank of America Ltd/ FFAA23472394- …

Page 49: Data Partitioning in VLDB

Making fact tables thin

20072007…

20082008…

20092009…

20102010…

Dimension

Dimension

Dimension

Dimension

Dimension

Dimension

Dimension

Dimension

Page 50: Data Partitioning in VLDB

Join (physical)

Page 51: Data Partitioning in VLDB

Join

To perform a join the optimizer need make the following decisions:• Access path

how to access each table

• Join orderif more than 2 tables/views are joined, which join to do first

• Join methodfor each pair of row resource how to perform the join

Page 52: Data Partitioning in VLDB

Join methods– nested loop

• One input is the outer loop, the other input is the inner loop

• The inner loop is executed for each row in the outer loop

• Effective when – The outer loop is small– The inner loop is pre indexed

Page 53: Data Partitioning in VLDB

Join methods– hash

• The smaller of the 2 inputs is named the build input

• The second is probe input• Hash table is build from build input• Each row in the build input is put in the

appropriate bucket• The entire probe input is scanned

Page 54: Data Partitioning in VLDB

Join methods– hash cont’

• For each row the hash value is calculated• The corresponding hash bucket is scanned to

find matched rows in the build input• Good for joining large amount

of data

Page 55: Data Partitioning in VLDB

Join methods– merge

• There is no concept of driving table• Both input sources are sorted according the

join key ( or use sorted source such as index)• The sorted lists are merged together • The merge itself is very fast, but it can be

expensive to sort the sources

Page 56: Data Partitioning in VLDB

Summary

• It’s all about I/O• Star schema – facts and dimensions• Partitions + local indexes• SQL joins (probably hash)

Page 57: Data Partitioning in VLDB

Q & A