EmbarrassinglyScalable, DatabaseSystems,€¦ · 12 0 4 8 12 16 1990 1995 2000 2005 2010 Year Pentium Itanium Intel3Core2 UltraSparc IBMPower AMD Contexts/chip Chipmulprocessors(mulcore)

Embarrassingly Scalable Database Systems

Anastasia Ailamaki Data-‐Intensive Applica2ons and Systems (DIAS)

Computer and Communica2on Sciences EPFL

2

From Wikipedia—An embarrassingly parallel workload is one for which li3le or no effort is required to separate the problem into a number of parallel tasks. This is o4en the case where there exists no dependency (or communica=on) between those parallel tasks.

Parallelism = the way forward •  Implicit parallelism

–  Simple cores offer mulFprogramming, pipelining –  SophisFcated cores are superscalar, mulFthreaded

• Explicit parallelism – Many-‐chip machines – Many-‐core chips in many-‐chip machines

3 *core = processor

1970 1980 1990 2000 2010 2020 Where is parallelism?

Core Core Core

Core Core Core Core

Core Core pipelining

Core

ILP+

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

datacenter

Objec=ve: all processing smoothly exploits available parallelism

4

cluster

5

Scalability takes a LOT of effort Conten2on-‐free

workload!

Our systems should be future-‐proof

0

200

400

600

0 8 16 24 32 Concurrent Threads

TPS

0.1

1

10


shore

BerkeleyDB

mysql

postgres

commercialDB

TPS/thread

BerkeleyDB

commercialDB

postgres

mysql

shore

One-‐slide summary •  New hardware: implicit AND explicit parallelism

–  1990: “parallelize as you go” –  2011: “parallelize as you go” × #ctx on chip × #chips

•  CommunicaFon, is no longer simple –  1990: “local” and “remote” –  Today: “local”, “not so local”, “somewhat remote”, …

•  1D philosophy (shared-‐nothing only, or shared-‐everything only) will no longer work

• must adapt to available parallelism

6 Answer: embarrassingly scalable DBMS

7

An embarrassingly scalable system is one for which li3le or no effort is required to

perform proporFonally on very small to very large numbers of hardware contexts.

Outline •  Hardware evolu=on

–  New hardware = new form of parallelism

•  Efficient use of memory hierarchy •  Keeping hardware contexts busy •  Lessons for the future

8

MulFprocessor plahorms

9

disk

core

memory

1970

disk

core

memory

disk

core

memory

disk

core

memory

disk

core

memory

1980 Shared-‐nothing parallelism natural to

database processing!

• Moore’s law single-‐core performance –  2x faster cores every 18 months

•  InstrucFon-‐Level Parallelism (ILP) –  Pipelines, superscalar, OOO, branch predicFon, overlapping cache misses

•  Simultaneous mulFthreading –  Implements threads in a superscalar processor

DB code: >60% read/write instruc=ons =ght instruc=on dependencies [Ail99] 10

disk

core

memory

L2 cache

90’s: fine-‐grain, implicit parallelism

Implicit = parallelize-‐as-‐you-‐go

Gloomy news for database workloads: •  Not much ILP opportunity •  Hurt by growing processor/memory speed gap

11

12

0

4

8

12

16

1990 1995 2000 2005 2010Year

PentiumItaniumIntel Core2UltraSparcIBM PowerAMD

Contexts/chip

Chip mulFprocessors (mulFcore) •  Single-‐processor performance has stalled…

–  Power, heat, design/verificaFon complexity –  Diminishing returns (esp. for DBMS!)

• … Moore’s law has not –  2xTransistors per 18-‐24 mo.

•  Now: mulFcore –  Slower, power-‐saving –  Lots of cores, big caches –  Throughput-‐oriented

disk

core

memory

cache

CPU1

L1I L1D

L3 CACHE

MAIN MEMORY

CPU0

L1I L1D

CPU1

L1I L1D

L2 CACHE

CHIP 2

CPU0

CHIP 1

L1I L1D

L2 CACHE

Today’s picture

Non-‐uniform cache access Exponen=ally many available hardware contexts 13

MulF-‐core technology trends •  Fat Camp (FC)

wide-‐issue, OOO e.g., IBM Power5

•  Lean Camp (LC) in-‐order, mulF-‐threaded e.g., Sun UltraSparc T1

one core

FC: parallelism within thread (ILP) LC: parallelism across threads

14

[Har07]

So how much do we use those cores?

15

Some contexts busy All contexts busy

25% useful work 65% wait for cache




Efficient use of cache = maximize sharing All contexts busy = parallelize

Outline •  Hardware evoluFon •  Efficient use of memory hierarchy

–  Maximizing sharing poten=al

•  Keeping hardware contexts busy •  Lessons for the future

16

Cache-‐conscious algorithms • Minimize unnecessary trips to slow memory

–  Data layout opFmizaFons –  Bunch-‐of-‐tuples-‐at-‐a-‐Fme query execuFon

•  Hide impact of cache misses –  New algorithms that trade accuracy for prefetching –  Make common case (sorFng, hashing, etc) efficient

•  Reduce dependencies/help predicFon –  Compiler-‐based techniques

17 Very important but not enough

[e.g. Ail01, Sto05, Bon05]

[e.g. Che04, Gho05]

•  Queries handled by independent threads •  Threads have large instrucFon/data footprint •  Lots of interference at the memory/cache level

Database System

thread pool

x no

coordinaFon

S

J S

J

Eliminate interference and expose locality

Running data analysis queries

Service-‐Oriented Architecture

One server Request-‐level parallelism

Very large footprint

Monolithic server

quer

ies

Stage 3 Stage 2

Stage 1

SOA-style (staged) server

queries

Orthogonal to algorithmic optimizations

Conventional

Many services Operator-level parallelism

Much smaller footprint vs.

Longest anyone ever took to earn a PhD?

Average Fme to finish a PhD in CS?

=processing thread

20

“Classic” DB query engine

scan

join

average

scan

output

Student Dept

max

output

scan Student

4 + 2 70% of execu=on =me is data cache stalls

dispatcher

scan

Q Q

join

average Q Q

read write

read

Service-‐oriented approach

21 Maximum opportunity for sharing!

[Har05]

Work sharing example

scan

join

average

scan

output

Student Dept

max

output

scan Student

I/O bound on uniprocessor: >2x speedup 22

23

To share, or not to share?

0.0

1.0

2.0

0 15 30 45 Shared queries

Speedup from work sharing (read-‐only queries) 1 CPU

8 CPU

Great.

Now let’s run on 8 processors.

Ouch.

How can sharing destroy parallelism?

[Jon07]

24

Work sharing in the cri2cal path

Query 2

Query 1

Query 2 response Fme


Scan

Join

Aggregate

P = 4.33

CriFcal paths

25

Work sharing lengthens criFcal path



Penalty

Scan

Join

Aggregate

P = 2.75

P = 4.33

CriFcal path now longer

Query 2

Query 1

Work sharing eliminated 60% of work but reduced available parallelism by 1.6x

26

PredicFng criFcal paths • Work sharing trade-‐off

⎟⎟⎠

⎞⎜⎜⎝

⎛=

||1,

||1

CPathWorkfPerf

• Model-‐guided sharing –  Predict impact of sharing –  IdenFfy bad combinaFons –  Inform work sharing policy 0

50

100

150

200

250

Share-‐haters 1:1 Share-‐lovers Query mix raFo

Queries/min

always share

never share

balanced

Balance between sharing and parallelism

Summary: Implicit parallelism

• WE NEED cache-‐conscious query processing –  To exploit instrucFon-‐level parallelism

•  Create sharing opportuniFes –  Share data, instrucFons, and work

•  But, NEVER lengthen cri2cal path –  Trade sharing for parallelism

•  Program with scalability in mind –  Think global / act local

27

Outline

•  Hardware evoluFon •  Efficient use of memory hierarchy •  Keeping hardware contexts busy

–  Turn concurrency into parallelism

•  Lessons for the future

28

29

On-‐line transacFon processing

Concurrency != parallelism

0.1

1

10


Throughp

ut (tps/thread)

shore

BerkeleyDB

mysql postgres

commercialDB

Conten2on-‐free workload!

[Jon09a]

30

Amdahl’s Law

Bixen by Amdahl’s law

where p = parallel fracFon of work N = hardware parallelism

The maximum benefit from a parallel system is given by

0

2

4

6

8

10

0% 20% 40% 60% 80% 100% Degree of serializaFon (1-‐p)

PostgreSQL

MySQL BerkeleyDB 59%

80%

8%

Scaleup for N=32

N p p

time old time new + - ≥ ) 1 ( _ _

scaleup = 1

Even a li3le serial code hurts a lot!

1/(1-‐0.08+0.92/32)=9.19

p=92% Scaleup =

Shared-‐everything  Lots of cri2cal sec2ons protect shared data

Core Core Core Core

Core Core Core Core

CPU

Core Core Core Core

Core Core Core Core

CPU

0

10

20

30

40

50

60

70

80

Shared-‐Everything DORA PLP

CSs p

er Transac=o

n

Other

Message passing

Xct mgr

Log mgr

Page Latches

Lock mgr

[Jon08, Jon09]

Transac2on processing engine

Locking logical enFFes (e.g., records)

Data

John Anne Chris Niki

Locking = serial code 32

Typical Lock Manager

L1 EX

EX L2 EX

T1

Lock Head Lock Hash Table

Queue Lock Requests

Xct’s Lock Requests

33

Time Inside the lock manager Sun Niagara T2

TPC-‐B

34

0%

20%

40%

60%

80%

100%

1 5 9 18 26 33 37 40 43 46 49 52 53 57 64

Time Breakd

own (%

)

# HW Contexts

LM Release Cont

LM Release

LM Acquire Cont

LM Acquire

Higher HW parallelism Longer Request Queues Longer CSs Higher Conten=on

Unpredictable access paxerns

35 Data par==oning?

Transac=on Processing

Databa

se re

cords

Shared-‐Nothing: Physical parFFoning

 Explicit contenFon control   No logging, locking, latching

  Physically separated data   Distributed transacFons   High reparFFoning cost   Very sensiFve to skew   Redundancy: memory pressure

•  ParFFoning 1024-‐way??

•  Concurrency control? MulFcore mulFsocket machine

Core Core Core Core

Core Core Core Core

CPU

Core Core Core Core

Core Core Core Core

CPU

36 Can we make shared-‐everything scale?

[Sto07, Dew90]

[Cur10]

[Jones10]

Shared-‐everything -‐ Logical parFFoning

Data Data Data Data

Move conten=on away from cri=cal path

John Anne Chris Niki

37

Data

Data-‐Oriented Architecture (DORA) •  Shared-‐everything -‐ Logically ParFFoned

  Got rid of centralized lock manager   Very fast reparFFoning against load-‐imbalances   SFll contenFon at the physical layers

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

0

10

20

30

40

50

60

70

80


CSs p

er Transac=o

n

Other

Message passing

Xct mgr

Log mgr

Page Latches

Lock mgr

[Pan10]

Predictable access paxerns

39

Databa

se re

cords

0 20 40 60 80

100 120

0 50 100

Throughp

ut (k

TpS)

Real CPU Load (%)

Looming problem: physical page latches

Page Latch contenFon •  Shared-‐everything -‐ Physiological ParFFoning [PLP]

  Eliminates most of the contenFon at the physical layers   Fast reparFFoning

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core 0

10

20

30

40

50

60

70

80


CSs p

er Transac=o

n

Other

Message passing

Xct mgr

Log mgr

Page Latches

Lock mgr

[Pan11]

Logging is crucial for OLTP

•  TransacFons must –  Write a log record describing every update –  when ready to commit, write log to disk!

•  Great for single-‐thread performance –  But not scalable! –  Compromise performance or recoverability

42 * hxp://www.datacenterknowledge.com/archives/2010/05/13/car-‐crash-‐triggers-‐amazon-‐power-‐outage/

(e.g., Amazon outage*)

$$$

Need efficient and scalable logging solu=on

Why logging hurts scalability

•  Working around the boxlenecks: –  Asynchronous commit –  Replace logging with replicaFon and fail-‐over

43

(1) At commit, must yield for log flush   synchronous I/O at criFcal path   locks held for long Fme   two context switches per commit

(2) Must insert records to the log buffer   centralized main-‐memory structure   source of contenFon

CPU-‐1

L1 L2

CPU-‐2

L1

CPU-‐N

L1

Data Log

CPU

RAM

HDD

Workarounds compromise durability

44

Attempts to scale Shore’s logmgr

0 2 4 6 8 10 12


Throughp

ut (k

tps)

MCS mutex T&T&S mutex Baseline

Cannot scale by improving 1-‐thr performance

[Jon10a]

Does “correct” logging have to be so slow?

•  Locks held for long Fme –  Not actually used during the flush –  Indirect way to enforce isolaFon (Early Lock Release)

•  Two context switches per commit –  TransacFons nearly stateless at commit Fme –  Easy to migrate transacFons between threads

•  Log buffer is source of contenFon –  Log orders incoming requests, not threads –  Log records can be combined

45

Compose scalability by solving each problem

[Jon10]

Mutex held Start/finish Copy into buffer WaiFng

AlleviaFng ContenFon

46

ConsolidaFon array (C)

(D) Decoupled buffer insert All together (CD)

(B) Baseline

(D) Decoupled buffer insert All together (CD)

(B) Baseline

contention(work) = O(1)

contention(# threads) = O(1)

Performance as contenFon increases

47

0.01

0.1

1

10

1 4 16 64

Log insert ra

te (G

B/s)

# of threads

Baseline Decoupled (D) ConsolidaFon (C) Hybrid (CD)

Hybrid solu=on combines benefits of both

48

Log redesign = scalability

0 2 4 6 8 10 12


Throughp

ut (k

tps)

Aether MCS mutex T&T&S mutex Baseline

Scalability >> performance

How far can we go?

49

Scalability implies performance!

Sun Niagara T1 Insert-‐only workload

0.1

1

10


Throughp

ut (tps/thread)

shore-‐mt*

shore

commercialDB Core Core Core Core

CHIP

Core Core Core Core

CHIP

*Shore-‐MT available at dias.epfl.ch

[Jon09]

50

Summary: Explicit parallelism •  Keeping hardware contexts busy

–  There’s no escaping from Amdahl –  make it scale if you want it to run fast

•  ParFFoning eliminates contenFon –  Shared-‐nothing carries overhead –  Shared-‐everything made fast with logical parFFoning –  Employ shared-‐everything on shared-‐nothing “islands”

•  Concurrency ≠ parallelism –  Find right dimension/decouple logically unrelated operaFons

51

Future: The rise of the power wall •  ILP era (ca. 1990)

• MulFcore era (ca. 2000+) •  Heterogenous era (e.g., AMD fusion)

¢û¢

ûû

¢û¢

ûû

Think global, act local

CPU NPU GPU FPGA

CPU CPU NPU NPU

GPU

GPU

CPU NPU

GPU

cache cache cache cache

[Mul09] [He08] [Gol05]

Thank you!

Mike Carey Alkis Polyzo2s

Divesh Srivastava

Special thanks to…

for their comments!

References -‐ I [Ail99] A. Ailamaki, D. J. DeWix, M. D. Hill, D.A. Wood: DBMSs on a modern processor: Where Does Time Go?, VLDB 1999 [Ail01] A. Ailamaki, D. J. DeWix, M. D. Hill, M. Skounakis: Weaving RelaFons for Cache Performance. VLDB 2001 [Bon05] P. Boncz, M. Zukowski, N. Nes: MonetDB/X100: Hyper-‐Pipelining Query ExecuFon. CIDR 2005 [Che04] S. Chen, A. Ailamaki, P. B. Gibbons, T. C. Mowry: Improving Hash Join Performance through Prefetching. ICDE 2004 [Dew90] D. J. DeWix, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H. Hsiao, R. Rasmussen: The Gamma Database Machine Project. IEEE TKDE 1990 [Gho05] A. GhoFng, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y. Chen, P. Dubey: Cache-‐conscious Frequent Paxern Mining on a Modern Processor. VLDB 2005 [Gol05] B. T. Gold, A. Ailamaki, L. Huston, B. Falsafi: AcceleraFng Database OperaFons Using a Network Processor. DaMoN 2005 [Har07] N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril, A. Ailamaki, B. Falsafi: Database Servers on Chip MulFprocessors: LimitaFons and OpportuniFes. CIDR 2007

53

References -‐ II [Har08] S. Harizopoulos, D. J. Abadi, S. Madden, M. Stonebraker: OLTP through the looking glass, and what we found there. SIGMOD 2008 [Har05] S. Harizopoulos, V. Shkapenyuk, A. Ailamaki: QPipe: A Simultaneously Pipelined RelaFonal Query Engine. SIGMOD 2005 [He08] B. He, K. Yang, R. Fang, M. Lu, N. K. Govindaraju, Q. Luo, P. V. Sander: RelaFonal joins on graphics processors. SIGMOD 2008 [Jon07] R. Johnson, N. Hardavellas, I. Pandis, N. Mancheril, S. Harizopoulos, K. Sabirli, A. Ailamaki, B. Falsafi: To Share or Not To Share? VLDB 2007 [Jon08] R. Johnson, I. Pandis, A. Ailamaki: CriFcal secFons: re-‐emerging scalability concerns for database storage engines. DaMoN 2008 [Jon09] R. Johnson, I. Pandis, A. Ailamaki: Improving OLTP Scalability using SpeculaFve Lock Inheritance. VLDB 2009 [Jon09a] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, B. Falsafi: Shore-‐MT: a scalable storage manager for the mulFcore era. EDBT 2009 [Jon10] R. Johnson, I. Pandis, R. Stoica, M. Athanassoulis, A. Ailamaki: Aether: A Scalable Approach to Logging. PVLDB 2010

54

References -‐ III [Jon10a] R. Johnson, R. Stoica, A. Ailamaki, T. C. Mowry: Decoupling contenFon management from scheduling. ASPLOS 2010 [Mul09] R. Müller, J. Teubner, G. Alonso: Data Processing on FPGAs. PVLDB 2009 [Pan10] I. Pandis, R. Johnson, N. Hardavellas, A. Ailamaki: Data-‐Oriented TransacFon ExecuFon. PVLDB 2010 [Pan11] I. Pandis, P. Tözün, R. Johnson, A. Ailamaki: PLP: Page Latch-‐free Shared-‐everything OLTP. Technical Report, EPFL DIAS, 2011 (available upon request) [Sto05] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, S. B. Zdonik: C-‐Store: A Column-‐oriented DBMS. VLDB 2005 [Sto07] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, P. Helland: The End of an Architectural Era (It's Time for a Complete Rewrite). VLDB 2007 [Cur10] C. Curino, Y. Zhang, E. P. C. Jones, S. Madden: Schism: a Workload-‐Driven Approach to Database ReplicaFon and ParFFoning. PVLDB 2010 [Jones10] E. P. C. Jones, D. J. Abadi, S. Madden: Low overhead concurrency control for parFFoned main memory databases. SIGMOD 2010

55

Documents

EmbarrassinglyScalable, DatabaseSystems,€¦ · 12 0 4 8 12 16 1990 1995 2000 2005 2010 Year Pentium Itanium Intel3Core2 UltraSparc IBMPower AMD Contexts/chip Chipmulprocessors(mulcore)