39
Introduction to Data Management CSE 344 Lecture 24: Transactions (wrap up) Parallel Databases CSE 344 - Winter 2015 1

Lecture 24: Transactions (wrap up) Parallel Databases

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 24: Transactions (wrap up) Parallel Databases

Introduction to Data Management CSE 344

Lecture 24: Transactions (wrap up) Parallel Databases

CSE 344 - Winter 2015 1

Page 2: Lecture 24: Transactions (wrap up) Parallel Databases

Announcements

•  HW7 due today •  HW8 (last one!) will be posted by tomorrow /

Wednesday – Will take more hours than other HWs (complex

setup, queries run for many hours) – Plan ahead! – Details in sections this week

•  Today: wrap up transaction implementations and start parallel databases – Traditional, MapReduce+PigLatin

CSE 344 - Winter 2015 2

Page 3: Lecture 24: Transactions (wrap up) Parallel Databases

Last Friday

•  Using locks to implement scheduler

CSE 344 - Winter 2015 3

None READ LOCK

RESERVED LOCK

PENDING LOCK

EXCLUSIVE LOCK

commit executed

begin transaction first write no more read locks commit requested

commit

Page 4: Lecture 24: Transactions (wrap up) Parallel Databases

Last Friday •  Problems with using locks:

– Serializability of schedules – Recoverable schedules – Deadlocks

•  Solutions: –  2PL, strict 2PL – Deadlock detection

•  Exclusive and shared locks

CSE 344 - Winter 2015 4

Page 5: Lecture 24: Transactions (wrap up) Parallel Databases

5

Lock Granularity

•  Fine granularity locking (e.g., tuples) –  High concurrency –  High overhead in managing locks –  E.g. SQL Server

•  Coarse grain locking (e.g., tables, entire database) –  Many false conflicts –  Less overhead in managing locks –  E.g. SQL Lite

CSE 344 - Winter 2015

Page 6: Lecture 24: Transactions (wrap up) Parallel Databases

Lock Performance

CSE 344 - Winter 2015 6

Thro

ughp

ut (T

PS

)

# Active Transactions

thrashing

Why ?

TPS = Transactions per second

Page 7: Lecture 24: Transactions (wrap up) Parallel Databases

7

Phantom Problem

•  So far we have assumed the database to be a static collection of elements (=tuples)

•  If tuples are inserted/deleted then the phantom problem appears

CSE 344 - Winter 2015

Page 8: Lecture 24: Transactions (wrap up) Parallel Databases

Phantom Problem

Is this schedule serializable ?

T1 T2 SELECT * FROM Product WHERE color=‘blue’

INSERT INTO Product(name, color) VALUES (‘A3’,’blue’)

SELECT * FROM Product WHERE color=‘blue’

Suppose there are two blue products, A1, A2:

CSE 344 - Winter 2015 8

Page 9: Lecture 24: Transactions (wrap up) Parallel Databases

Phantom Problem

9

R1(A1),R1(A2),W2(A3),R1(A1),R1(A2),R1(A3)

T1 T2 SELECT * FROM Product WHERE color=‘blue’

INSERT INTO Product(name, color) VALUES (‘A3’,’blue’)

SELECT * FROM Product WHERE color=‘blue’

CSE 344 - Winter 2015

Suppose there are two blue products, A1, A2:

Page 10: Lecture 24: Transactions (wrap up) Parallel Databases

Phantom Problem

10

R1(A1),R1(A2),W2(A3),R1(A1),R1(A2),R1(A3)

T1 T2 SELECT * FROM Product WHERE color=‘blue’

INSERT INTO Product(name, color) VALUES (‘A3’,’blue’)

SELECT * FROM Product WHERE color=‘blue’

W2(A3),R1(A1),R1(A2),R1(A1),R1(A2),R1(A3)

Suppose there are two blue products, A1, A2:

Page 11: Lecture 24: Transactions (wrap up) Parallel Databases

11

Phantom Problem

•  A “phantom” is a tuple that is invisible during part of a transaction execution but not invisible during the entire execution

•  In our example: – T1: reads list of products – T2: inserts a new product – T1: re-reads: Whoa! a new product appears !

CSE 344 - Winter 2015

Page 12: Lecture 24: Transactions (wrap up) Parallel Databases

Dealing With Phantoms

•  Lock the entire table •  Lock the index entry for ‘blue’

–  If index is available •  Or use predicate locks

– A lock on an arbitrary predicate

Dealing with phantoms is expensive ! CSE 344 - Winter 2015 12

Page 13: Lecture 24: Transactions (wrap up) Parallel Databases

13

Isolation Levels in SQL

1.  “Dirty reads” SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

2.  “Committed reads” SET TRANSACTION ISOLATION LEVEL READ COMMITTED

3.  “Repeatable reads” SET TRANSACTION ISOLATION LEVEL REPEATABLE READ

4.  Serializable transactions SET TRANSACTION ISOLATION LEVEL SERIALIZABLE

ACID

CSE 344 - Winter 2015

Page 14: Lecture 24: Transactions (wrap up) Parallel Databases

1. Isolation Level: Dirty Reads

•  “Long duration” WRITE locks – Strict 2PL

•  No READ locks – Read-only transactions are never delayed

14

Possible problems: dirty and inconsistent reads

CSE 344 - Winter 2015

Page 15: Lecture 24: Transactions (wrap up) Parallel Databases

2. Isolation Level: Read Committed

•  “Long duration” WRITE locks – Strict 2PL

•  “Short duration” READ locks – Only acquire lock while reading (not 2PL)

15

Unrepeatable reads When reading same element twice, may get two different values

CSE 344 - Winter 2015

Page 16: Lecture 24: Transactions (wrap up) Parallel Databases

3. Isolation Level: Repeatable Read

•  “Long duration” WRITE locks – Strict 2PL

•  “Long duration” READ locks – Strict 2PL

16

This is not serializable yet !!!

Why ?

CSE 344 - Winter 2015

Page 17: Lecture 24: Transactions (wrap up) Parallel Databases

4. Isolation Level Serializable

•  “Long duration” WRITE locks – Strict 2PL

•  “Long duration” READ locks – Strict 2PL

•  Predicate locking – To deal with phantoms

17 CSE 344 - Winter 2015

Page 18: Lecture 24: Transactions (wrap up) Parallel Databases

Beware!

In commercial DBMSs: •  Default level is often NOT serializable •  Default level differs between DBMSs •  Some engines support subset of levels! •  Serializable may not be exactly ACID

–  Locking ensures isolation, not atomicity

•  Also, some DBMSs do NOT use locking and different isolation levels can lead to different pbs

•  Bottom line: Read the doc for your DBMS!

CSE 344 - Winter 2015 18

Page 19: Lecture 24: Transactions (wrap up) Parallel Databases

Parallel Databases

CSE 344 - Winter 2015 19

Page 20: Lecture 24: Transactions (wrap up) Parallel Databases

Why compute in parallel?

•  Most processors have multiple cores – Can run multiple jobs simultaneously

•  Natural extension of txn processing

•  Nice to get computation results earlier J

CSE 344 - Winter 2015 20

Page 21: Lecture 24: Transactions (wrap up) Parallel Databases

Cloud computing

Cloud computing commoditizes access to large clusters

–  Ten years ago, only Google could afford 1000 servers;

–  Today you can rent this from Amazon Web Services (AWS) Cheap!

21 CSE 344 - Winter 2015

Page 22: Lecture 24: Transactions (wrap up) Parallel Databases

Science is Facing a Data Deluge!

•  Astronomy: Large Synoptic Survey Telescope LSST: 30TB/night (high-resolution, high-frequency sky surveys)

•  Physics: Large Hadron Collider 25PB/year •  Biology: lab automation, high-throughput

sequencing •  Oceanography: high-resolution models, cheap

sensors, satellites •  Medicine: ubiquitous digital records, MRI,

ultrasound

22 CSE 344 - Winter 2015

Page 23: Lecture 24: Transactions (wrap up) Parallel Databases

Industry is Facing a Data Deluge!

Clickstreams, search logs, network logs, social networking data, RFID data, etc. •  Facebook:

–  15PB of data in 2010 –  60TB of new data every day

•  Google: –  In May 2010 processed 946PB of data using

MapReduce •  Twitter, Google, Microsoft, Amazon, Walmart,

etc.

23 CSE 344 - Winter 2015

Page 24: Lecture 24: Transactions (wrap up) Parallel Databases

Big Data

•  Companies, organizations, scientists have data that is too big, too fast, and too complex to be managed without changing tools and processes

•  Relational algebra and SQL are easy to parallelize and parallel DBMSs have already been studied in the 80's!

CSE 344 - Winter 2015 24

Page 25: Lecture 24: Transactions (wrap up) Parallel Databases

Data Analytics Companies As a result, we are seeing an explosion of and a huge success of db analytics companies

•  Greenplum founded in 2003 acquired by EMC in 2010; A parallel shared-nothing DBMS (this lecture)

•  Vertica founded in 2005 and acquired by HP in 2011; A parallel, column-store shared-nothing DBMS (see 444 for discussion of column-stores)

•  DATAllegro founded in 2003 acquired by Microsoft in 2008; A parallel, shared-nothing DBMS

•  Aster Data Systems founded in 2005 acquired by Teradata in 2011; A parallel, shared-nothing, MapReduce-based data processing system (next lecture). SQL on top of MapReduce

•  Netezza founded in 2000 and acquired by IBM in 2010. A parallel, shared-nothing DBMS.

Great time to be in the data management, data mining/statistics, or machine learning! 25

Page 26: Lecture 24: Transactions (wrap up) Parallel Databases

Two Kinds to Parallel Data Processing

•  Parallel databases, developed starting with the 80s (this lecture) – OLTP (Online Transaction Processing) – OLAP (Online Analytic Processing, or

Decision Support) •  MapReduce, first developed by Google,

published in 2004 (next lecture) – Mostly for Decision Support Queries

Today we see convergence of the two approaches (Greenplum,Dremmel) 26 CSE 344 - Winter 2015

Page 27: Lecture 24: Transactions (wrap up) Parallel Databases

Performance Metrics for Parallel DBMSs

P = the number of nodes (processors, computers) •  Speedup:

– More nodes, same data è higher speed •  Scaleup:

– More nodes, more data è same speed

•  OLTP: “Speed” = transactions per second (TPS) •  Decision Support: “Speed” = query time

CSE 344 - Winter 2015 27

Page 28: Lecture 24: Transactions (wrap up) Parallel Databases

Linear v.s. Non-linear Speedup

CSE 344 - Winter 2015

# nodes (=P)

Speedup

28

×1 ×5 ×10 ×15

Page 29: Lecture 24: Transactions (wrap up) Parallel Databases

Linear v.s. Non-linear Scaleup

# nodes (=P) AND data size

Batch Scaleup

×1 ×5 ×10 ×15

CSE 344 - Winter 2015 29

Ideal

Page 30: Lecture 24: Transactions (wrap up) Parallel Databases

Challenges to Linear Speedup and Scaleup

•  Startup cost – Cost of starting an operation on many nodes

•  Interference – Contention for resources between nodes

•  Skew – Slowest node becomes the bottleneck

CSE 344 - Winter 2015 30

Page 31: Lecture 24: Transactions (wrap up) Parallel Databases

Architectures for Parallel Databases

•  Shared memory

•  Shared disk

•  Shared nothing

CSE 344 - Winter 2015 31

Page 32: Lecture 24: Transactions (wrap up) Parallel Databases

Shared Memory

Interconnection Network

P P P

Global Shared Memory

D D D 32

Page 33: Lecture 24: Transactions (wrap up) Parallel Databases

Shared Disk

Interconnection Network

P P P

M M M

D D D 33

Page 34: Lecture 24: Transactions (wrap up) Parallel Databases

Shared Nothing

Interconnection Network

P P P

M M M

D D D 34

Page 35: Lecture 24: Transactions (wrap up) Parallel Databases

A Professional Picture…

35

From: Greenplum Database Whitepaper

SAN = “Storage Area Network”

CSE 344 - Winter 2015

Page 36: Lecture 24: Transactions (wrap up) Parallel Databases

Shared Memory •  Nodes share both RAM and disk •  Dozens to hundreds of processors

Example: SQL Server runs on a single machine and can leverage many threads to get a query to run faster (see query plans)

•  Easy to use and program •  But very expensive to scale: last remaining

cash cows in the hardware industry

CSE 344 - Winter 2015 36

Page 37: Lecture 24: Transactions (wrap up) Parallel Databases

Shared Disk •  All nodes access the same disks •  Found in the largest "single-box" (non-

cluster) multiprocessors

Oracle dominates this class of systems.

Characteristics: •  Also hard to scale past a certain point:

existing deployments typically have fewer than 10 machines

CSE 344 - Winter 2015 37

Page 38: Lecture 24: Transactions (wrap up) Parallel Databases

Shared Nothing •  Cluster of machines on high-speed network •  Called "clusters" or "blade servers” •  Each machine has its own memory and disk: lowest

contention. NOTE: Because all machines today have many cores and many disks, then shared-nothing systems typically run many "nodes” on a single physical machine.

Characteristics: •  Today, this is the most scalable architecture. •  Most difficult to administer and tune.

We discuss only Shared Nothing in class 38

Page 39: Lecture 24: Transactions (wrap up) Parallel Databases

Purchase

pid=pid

cid=cid

Customer

Product Purchase

pid=pid

cid=cid

Customer

Product

Approaches to Parallel Query Evaluation

•  Inter-query parallelism –  Transaction per node –  OLTP

•  Inter-operator parallelism –  Operator per node –  Both OLTP and Decision Support

•  Intra-operator parallelism –  Operator on multiple nodes –  Decision Support

CSE 344 - Winter 2015 We study only intra-operator parallelism: most scalable

Purchase

pid=pid

cid=cid

Customer

Product

Purchase

pid=pid

cid=cid

Customer

Product

Purchase

pid=pid

cid=cid

Customer

Product

39