Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Embarrassingly Scalable Database Systems
Anastasia Ailamaki Data-‐Intensive Applica2ons and Systems (DIAS)
Computer and Communica2on Sciences EPFL
2
From Wikipedia—An embarrassingly parallel workload is one for which li3le or no effort is required to separate the problem into a number of parallel tasks. This is o4en the case where there exists no dependency (or communica=on) between those parallel tasks.
Parallelism = the way forward • Implicit parallelism
– Simple cores offer mulFprogramming, pipelining – SophisFcated cores are superscalar, mulFthreaded
• Explicit parallelism – Many-‐chip machines – Many-‐core chips in many-‐chip machines
3 *core = processor
1970 1980 1990 2000 2010 2020 Where is parallelism?
Core Core Core
Core Core Core Core
Core Core pipelining
Core
ILP+
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
datacenter
Objec=ve: all processing smoothly exploits available parallelism
4
cluster
5
Scalability takes a LOT of effort Conten2on-‐free
workload!
Our systems should be future-‐proof
0
200
400
600
0 8 16 24 32 Concurrent Threads
TPS
0.1
1
10
0 8 16 24 32 Concurrent Threads
shore
BerkeleyDB
mysql
postgres
commercialDB
TPS/thread
BerkeleyDB
commercialDB
postgres
mysql
shore
One-‐slide summary • New hardware: implicit AND explicit parallelism
– 1990: “parallelize as you go” – 2011: “parallelize as you go” × #ctx on chip × #chips
• CommunicaFon, is no longer simple – 1990: “local” and “remote” – Today: “local”, “not so local”, “somewhat remote”, …
• 1D philosophy (shared-‐nothing only, or shared-‐everything only) will no longer work
• must adapt to available parallelism
6 Answer: embarrassingly scalable DBMS
7
An embarrassingly scalable system is one for which li3le or no effort is required to
perform proporFonally on very small to very large numbers of hardware contexts.
Outline • Hardware evolu=on
– New hardware = new form of parallelism
• Efficient use of memory hierarchy • Keeping hardware contexts busy • Lessons for the future
8
MulFprocessor plahorms
9
disk
core
memory
1970
disk
core
memory
disk
core
memory
disk
core
memory
disk
core
memory
1980 Shared-‐nothing parallelism natural to
database processing!
• Moore’s law single-‐core performance – 2x faster cores every 18 months
• InstrucFon-‐Level Parallelism (ILP) – Pipelines, superscalar, OOO, branch predicFon, overlapping cache misses
• Simultaneous mulFthreading – Implements threads in a superscalar processor
DB code: >60% read/write instruc=ons =ght instruc=on dependencies [Ail99] 10
disk
core
memory
L2 cache
90’s: fine-‐grain, implicit parallelism
Implicit = parallelize-‐as-‐you-‐go
Gloomy news for database workloads: • Not much ILP opportunity • Hurt by growing processor/memory speed gap
11
12
0
4
8
12
16
1990 1995 2000 2005 2010Year
PentiumItaniumIntel Core2UltraSparcIBM PowerAMD
Contexts/chip
Chip mulFprocessors (mulFcore) • Single-‐processor performance has stalled…
– Power, heat, design/verificaFon complexity – Diminishing returns (esp. for DBMS!)
• … Moore’s law has not – 2xTransistors per 18-‐24 mo.
• Now: mulFcore – Slower, power-‐saving – Lots of cores, big caches – Throughput-‐oriented
disk
core
memory
cache
CPU1
L1I L1D
L3 CACHE
MAIN MEMORY
CPU0
L1I L1D
CPU1
L1I L1D
L2 CACHE
CHIP 2
CPU0
CHIP 1
L1I L1D
L2 CACHE
Today’s picture
Non-‐uniform cache access Exponen=ally many available hardware contexts 13
MulF-‐core technology trends • Fat Camp (FC)
wide-‐issue, OOO e.g., IBM Power5
• Lean Camp (LC) in-‐order, mulF-‐threaded e.g., Sun UltraSparc T1
one core
FC: parallelism within thread (ILP) LC: parallelism across threads
14
[Har07]
So how much do we use those cores?
15
Some contexts busy All contexts busy
25% useful work 65% wait for cache
25% useful work 70% wait for cache
25% useful work 70% wait for cache
75% useful work 10% wait for cache
Efficient use of cache = maximize sharing All contexts busy = parallelize
Outline • Hardware evoluFon • Efficient use of memory hierarchy
– Maximizing sharing poten=al
• Keeping hardware contexts busy • Lessons for the future
16
Cache-‐conscious algorithms • Minimize unnecessary trips to slow memory
– Data layout opFmizaFons – Bunch-‐of-‐tuples-‐at-‐a-‐Fme query execuFon
• Hide impact of cache misses – New algorithms that trade accuracy for prefetching – Make common case (sorFng, hashing, etc) efficient
• Reduce dependencies/help predicFon – Compiler-‐based techniques
17 Very important but not enough
[e.g. Ail01, Sto05, Bon05]
[e.g. Che04, Gho05]
• Queries handled by independent threads • Threads have large instrucFon/data footprint • Lots of interference at the memory/cache level
Database System
thread pool
x no
coordinaFon
S
J S
J
Eliminate interference and expose locality
Running data analysis queries
Service-‐Oriented Architecture
One server Request-‐level parallelism
Very large footprint
Monolithic server
quer
ies
Stage 3 Stage 2
Stage 1
SOA-style (staged) server
queries
Orthogonal to algorithmic optimizations
Conventional
Many services Operator-level parallelism
Much smaller footprint vs.
Longest anyone ever took to earn a PhD?
Average Fme to finish a PhD in CS?
=processing thread
20
“Classic” DB query engine
scan
join
average
scan
output
Student Dept
max
output
scan Student
4 + 2 70% of execu=on =me is data cache stalls
dispatcher
scan
Q Q
join
average Q Q
read write
read
Service-‐oriented approach
21 Maximum opportunity for sharing!
[Har05]
Work sharing example
scan
join
average
scan
output
Student Dept
max
output
scan Student
I/O bound on uniprocessor: >2x speedup 22
23
To share, or not to share?
0.0
1.0
2.0
0 15 30 45 Shared queries
Speedup from work sharing (read-‐only queries) 1 CPU
8 CPU
Great.
Now let’s run on 8 processors.
Ouch.
How can sharing destroy parallelism?
[Jon07]
24
Work sharing in the cri2cal path
Query 2
Query 1
Query 2 response Fme
Query 1 response Fme
Scan
Join
Aggregate
P = 4.33
CriFcal paths
25
Work sharing lengthens criFcal path
Query 2 response Fme
Query 1 response Fme
Penalty
Scan
Join
Aggregate
P = 2.75
P = 4.33
CriFcal path now longer
Query 2
Query 1
Work sharing eliminated 60% of work but reduced available parallelism by 1.6x
26
PredicFng criFcal paths • Work sharing trade-‐off
⎟⎟⎠
⎞⎜⎜⎝
⎛=
||1,
||1
CPathWorkfPerf
• Model-‐guided sharing – Predict impact of sharing – IdenFfy bad combinaFons – Inform work sharing policy 0
50
100
150
200
250
Share-‐haters 1:1 Share-‐lovers Query mix raFo
Queries/min
always share
never share
balanced
Balance between sharing and parallelism
Summary: Implicit parallelism
• WE NEED cache-‐conscious query processing – To exploit instrucFon-‐level parallelism
• Create sharing opportuniFes – Share data, instrucFons, and work
• But, NEVER lengthen cri2cal path – Trade sharing for parallelism
• Program with scalability in mind – Think global / act local
27
Outline
• Hardware evoluFon • Efficient use of memory hierarchy • Keeping hardware contexts busy
– Turn concurrency into parallelism
• Lessons for the future
28
29
On-‐line transacFon processing
Concurrency != parallelism
0.1
1
10
0 8 16 24 32 Concurrent Threads
Throughp
ut (tps/thread)
shore
BerkeleyDB
mysql postgres
commercialDB
Conten2on-‐free workload!
[Jon09a]
30
Amdahl’s Law
Bixen by Amdahl’s law
where p = parallel fracFon of work N = hardware parallelism
The maximum benefit from a parallel system is given by
0
2
4
6
8
10
0% 20% 40% 60% 80% 100% Degree of serializaFon (1-‐p)
PostgreSQL
MySQL BerkeleyDB 59%
80%
8%
Scaleup for N=32
N p p
time old time new + - ≥ ) 1 ( _ _
scaleup = 1
Even a li3le serial code hurts a lot!
1/(1-‐0.08+0.92/32)=9.19
p=92% Scaleup =
Shared-‐everything Lots of cri2cal sec2ons protect shared data
Core Core Core Core
Core Core Core Core
CPU
Core Core Core Core
Core Core Core Core
CPU
0
10
20
30
40
50
60
70
80
Shared-‐Everything DORA PLP
CSs p
er Transac=o
n
Other
Message passing
Xct mgr
Log mgr
Page Latches
Lock mgr
[Jon08, Jon09]
Transac2on processing engine
Locking logical enFFes (e.g., records)
Data
John Anne Chris Niki
Locking = serial code 32
Typical Lock Manager
L1 EX
EX L2 EX
T1
Lock Head Lock Hash Table
Queue Lock Requests
Xct’s Lock Requests
33
Time Inside the lock manager Sun Niagara T2
TPC-‐B
34
0%
20%
40%
60%
80%
100%
1 5 9 18 26 33 37 40 43 46 49 52 53 57 64
Time Breakd
own (%
)
# HW Contexts
LM Release Cont
LM Release
LM Acquire Cont
LM Acquire
Higher HW parallelism Longer Request Queues Longer CSs Higher Conten=on
Unpredictable access paxerns
35 Data par==oning?
Transac=on Processing
Databa
se re
cords
Shared-‐Nothing: Physical parFFoning
Explicit contenFon control No logging, locking, latching
Physically separated data Distributed transacFons High reparFFoning cost Very sensiFve to skew Redundancy: memory pressure
• ParFFoning 1024-‐way??
• Concurrency control? MulFcore mulFsocket machine
Core Core Core Core
Core Core Core Core
CPU
Core Core Core Core
Core Core Core Core
CPU
36 Can we make shared-‐everything scale?
[Sto07, Dew90]
[Cur10]
[Jones10]
Shared-‐everything -‐ Logical parFFoning
Data Data Data Data
Move conten=on away from cri=cal path
John Anne Chris Niki
37
Data
Data-‐Oriented Architecture (DORA) • Shared-‐everything -‐ Logically ParFFoned
Got rid of centralized lock manager Very fast reparFFoning against load-‐imbalances SFll contenFon at the physical layers
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
0
10
20
30
40
50
60
70
80
Shared-‐Everything DORA PLP
CSs p
er Transac=o
n
Other
Message passing
Xct mgr
Log mgr
Page Latches
Lock mgr
[Pan10]
Predictable access paxerns
39
Databa
se re
cords
0 20 40 60 80
100 120
0 50 100
Throughp
ut (k
TpS)
Real CPU Load (%)
Looming problem: physical page latches
Page Latch contenFon • Shared-‐everything -‐ Physiological ParFFoning [PLP]
Eliminates most of the contenFon at the physical layers Fast reparFFoning
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core 0
10
20
30
40
50
60
70
80
Shared-‐Everything DORA PLP
CSs p
er Transac=o
n
Other
Message passing
Xct mgr
Log mgr
Page Latches
Lock mgr
[Pan11]
41
Logging is crucial for OLTP
• TransacFons must – Write a log record describing every update – when ready to commit, write log to disk!
• Great for single-‐thread performance – But not scalable! – Compromise performance or recoverability
42 * hxp://www.datacenterknowledge.com/archives/2010/05/13/car-‐crash-‐triggers-‐amazon-‐power-‐outage/
(e.g., Amazon outage*)
$$$
Need efficient and scalable logging solu=on
Why logging hurts scalability
• Working around the boxlenecks: – Asynchronous commit – Replace logging with replicaFon and fail-‐over
43
(1) At commit, must yield for log flush synchronous I/O at criFcal path locks held for long Fme two context switches per commit
(2) Must insert records to the log buffer centralized main-‐memory structure source of contenFon
CPU-‐1
L1 L2
CPU-‐2
L1
CPU-‐N
L1
Data Log
CPU
RAM
HDD
Workarounds compromise durability
44
Attempts to scale Shore’s logmgr
0 2 4 6 8 10 12
0 8 16 24 32 Concurrent Threads
Throughp
ut (k
tps)
MCS mutex T&T&S mutex Baseline
Cannot scale by improving 1-‐thr performance
[Jon10a]
Does “correct” logging have to be so slow?
• Locks held for long Fme – Not actually used during the flush – Indirect way to enforce isolaFon (Early Lock Release)
• Two context switches per commit – TransacFons nearly stateless at commit Fme – Easy to migrate transacFons between threads
• Log buffer is source of contenFon – Log orders incoming requests, not threads – Log records can be combined
45
Compose scalability by solving each problem
[Jon10]
Mutex held Start/finish Copy into buffer WaiFng
AlleviaFng ContenFon
46
ConsolidaFon array (C)
(D) Decoupled buffer insert All together (CD)
(B) Baseline
(D) Decoupled buffer insert All together (CD)
(B) Baseline
contention(work) = O(1)
contention(# threads) = O(1)
Performance as contenFon increases
47
0.01
0.1
1
10
1 4 16 64
Log insert ra
te (G
B/s)
# of threads
Baseline Decoupled (D) ConsolidaFon (C) Hybrid (CD)
Hybrid solu=on combines benefits of both
48
Log redesign = scalability
0 2 4 6 8 10 12
0 8 16 24 32 Concurrent Threads
Throughp
ut (k
tps)
Aether MCS mutex T&T&S mutex Baseline
Scalability >> performance
How far can we go?
49
Scalability implies performance!
Sun Niagara T1 Insert-‐only workload
0.1
1
10
0 8 16 24 32 Concurrent Threads
Throughp
ut (tps/thread)
shore-‐mt*
shore
commercialDB Core Core Core Core
CHIP
Core Core Core Core
CHIP
*Shore-‐MT available at dias.epfl.ch
[Jon09]
50
Summary: Explicit parallelism • Keeping hardware contexts busy
– There’s no escaping from Amdahl – make it scale if you want it to run fast
• ParFFoning eliminates contenFon – Shared-‐nothing carries overhead – Shared-‐everything made fast with logical parFFoning – Employ shared-‐everything on shared-‐nothing “islands”
• Concurrency ≠ parallelism – Find right dimension/decouple logically unrelated operaFons
51
Future: The rise of the power wall • ILP era (ca. 1990)
• MulFcore era (ca. 2000+) • Heterogenous era (e.g., AMD fusion)
¢û¢
ûû
¢û¢
ûû
Think global, act local
CPU NPU GPU FPGA
CPU CPU NPU NPU
GPU
GPU
CPU NPU
GPU
cache cache cache cache
[Mul09] [He08] [Gol05]
Thank you!
Mike Carey Alkis Polyzo2s
Divesh Srivastava
Special thanks to…
for their comments!
References -‐ I [Ail99] A. Ailamaki, D. J. DeWix, M. D. Hill, D.A. Wood: DBMSs on a modern processor: Where Does Time Go?, VLDB 1999 [Ail01] A. Ailamaki, D. J. DeWix, M. D. Hill, M. Skounakis: Weaving RelaFons for Cache Performance. VLDB 2001 [Bon05] P. Boncz, M. Zukowski, N. Nes: MonetDB/X100: Hyper-‐Pipelining Query ExecuFon. CIDR 2005 [Che04] S. Chen, A. Ailamaki, P. B. Gibbons, T. C. Mowry: Improving Hash Join Performance through Prefetching. ICDE 2004 [Dew90] D. J. DeWix, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H. Hsiao, R. Rasmussen: The Gamma Database Machine Project. IEEE TKDE 1990 [Gho05] A. GhoFng, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y. Chen, P. Dubey: Cache-‐conscious Frequent Paxern Mining on a Modern Processor. VLDB 2005 [Gol05] B. T. Gold, A. Ailamaki, L. Huston, B. Falsafi: AcceleraFng Database OperaFons Using a Network Processor. DaMoN 2005 [Har07] N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril, A. Ailamaki, B. Falsafi: Database Servers on Chip MulFprocessors: LimitaFons and OpportuniFes. CIDR 2007
53
References -‐ II [Har08] S. Harizopoulos, D. J. Abadi, S. Madden, M. Stonebraker: OLTP through the looking glass, and what we found there. SIGMOD 2008 [Har05] S. Harizopoulos, V. Shkapenyuk, A. Ailamaki: QPipe: A Simultaneously Pipelined RelaFonal Query Engine. SIGMOD 2005 [He08] B. He, K. Yang, R. Fang, M. Lu, N. K. Govindaraju, Q. Luo, P. V. Sander: RelaFonal joins on graphics processors. SIGMOD 2008 [Jon07] R. Johnson, N. Hardavellas, I. Pandis, N. Mancheril, S. Harizopoulos, K. Sabirli, A. Ailamaki, B. Falsafi: To Share or Not To Share? VLDB 2007 [Jon08] R. Johnson, I. Pandis, A. Ailamaki: CriFcal secFons: re-‐emerging scalability concerns for database storage engines. DaMoN 2008 [Jon09] R. Johnson, I. Pandis, A. Ailamaki: Improving OLTP Scalability using SpeculaFve Lock Inheritance. VLDB 2009 [Jon09a] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, B. Falsafi: Shore-‐MT: a scalable storage manager for the mulFcore era. EDBT 2009 [Jon10] R. Johnson, I. Pandis, R. Stoica, M. Athanassoulis, A. Ailamaki: Aether: A Scalable Approach to Logging. PVLDB 2010
54
References -‐ III [Jon10a] R. Johnson, R. Stoica, A. Ailamaki, T. C. Mowry: Decoupling contenFon management from scheduling. ASPLOS 2010 [Mul09] R. Müller, J. Teubner, G. Alonso: Data Processing on FPGAs. PVLDB 2009 [Pan10] I. Pandis, R. Johnson, N. Hardavellas, A. Ailamaki: Data-‐Oriented TransacFon ExecuFon. PVLDB 2010 [Pan11] I. Pandis, P. Tözün, R. Johnson, A. Ailamaki: PLP: Page Latch-‐free Shared-‐everything OLTP. Technical Report, EPFL DIAS, 2011 (available upon request) [Sto05] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, S. B. Zdonik: C-‐Store: A Column-‐oriented DBMS. VLDB 2005 [Sto07] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, P. Helland: The End of an Architectural Era (It's Time for a Complete Rewrite). VLDB 2007 [Cur10] C. Curino, Y. Zhang, E. P. C. Jones, S. Madden: Schism: a Workload-‐Driven Approach to Database ReplicaFon and ParFFoning. PVLDB 2010 [Jones10] E. P. C. Jones, D. J. Abadi, S. Madden: Low overhead concurrency control for parFFoned main memory databases. SIGMOD 2010
55