OLTP on Hardware Islands

OLTP on Hardware Islands

Danica Porobic, Ippokratis Pandis*, Miguel Branco, Pınar Tözün, Anastasia Ailamaki

Data-Intensive Application and Systems Lab, EPFL*IBM Research - Almaden

Hardware topologies have changed

Topology Core to core latency

Core

Core Core

Core Core

Core Core

Core Core

Island

CMP

SMP of CMP

low

variable

SMPhigh

(~10ns)

(10-100ns)

(~100ns)

2Variable latencies affect performance & predictability

Should we favor shared-nothing software configurations?

The Dreaded Distributed Transactions

3

% Multisite TransactionsTh

roug

hput

Shared-Nothing

Shared-Everything

• Shared-nothing OLTP configurations need to execute distributed transactions

• Extra lock holding time• Extra sys calls for comm.• Extra logging• Extra 2PC logic/instructions• ++• At least SE provide robust perf.• Some opportunities for constructive

interference (e.g. Aether logging)Shared-nothing configurations not the panacea!

Deploying OLTP on Hardware Islands

4HW + Workload -> Optimal OLTP configuration

Shared-everything Single address space Easy to programх Random read-write accesses х Many synchronization pointsх Sensitive to latency

Shared-nothing Thread-local accessesх Distributed transactionsх Sensitive to skew

Best Worst

Thread to core assignment

Perf

orm

ance

Outline• Introduction• Hardware Islands• OLTP on Hardware Islands• Conclusions and future work

5

6

Multisocket multicores

Communication latency varies significantly

Core Core Core

L1

L2

L1

L2

L1

L2

L3

Memory controller

Core

L1

L2

Core Core Core

L1

L2

L1

L2

L1

L2

L3

Core

L1

L2

L1

L3

Inter-socket links

Memory controller

Inter-socket linksInter-socket links Inter-socket links

50 cycles500 cycles

<10 cycles

OSOS

Placement of application threadsCounter microbenchmark TPC-C Payment

0

50000000

100000000

150000000

200000000

250000000

300000000

350000000

400000000

Thro

ughp

ut (M

tps)

0

2000

4000

6000

8000

10000

12000

Thro

ughp

ut (K

tps)

8socket x 10cores 4socket x 6cores

39%

7

Unpredictable

40%

47%

? ?? ?? ?? ?

? ?? ?

Spread IslandSpread Island

Islands-aware placement matters

8

Counter per core

Counter per

socket

Single counter

1000000

10000000

100000000

1000000000

10000000000

Thro

ughp

ut (M

tps)

Impact of sharing data among threadsCounter microbenchmark TPC-C Payment – local-only

Shared not..

.

Shared eve

ry...

020000400006000080000

100000120000140000160000

18.7x

516.8x

4.5x

Fewer sharers lead to higher performance

8socket x 10cores 4socket x 6cores

Log

scal

e

(vs 8x)

(vs 80x)

Outline• Introduction• Hardware Islands• OLTP on Hardware Islands

– Experimental setup– Read-only workloads– Update workloads– Impact of skew

• Conclusions and future work

9

Experimental setup• Shore-MT

– Top-of-the-line open source storage manager– Enabled shared-nothing capability

• Multisocket servers– 4-socket, 6-core Intel Xeon E7530, 64GB RAM– 8-socket, 10-core Intel Xeon E7-L8867, 192GB RAM

• Disabled hyper-threading• OS: Red Hat Enterprise Linux 6.2, kernel 2.6.32• Profiler: Intel VTune Amplifier XE 2011• Workloads: TPC-C, microbenchmarks

10

Microbenchmark workload• Single-site version

– Probe/update N rows from the local partition

• Multi-site version– Probe/update 1 row from the local partition– Probe/update N-1 rows uniformly from any partition– Partitions may reside on the same instance

• Input size: 10 000 rows/core

11

Software System Configurations

1 IslandShared-everything

24 IslandsShared-nothing

4 Islands

12

13

Increasing % of multisite xcts: reads

0 20 40 60 80 1000

20000400006000080000

100000120000140000160000180000200000 1 Island

24 Islands4 Islands

% Multisite transactions

Contention for shared data

No locks or latches

Messaging overhead

Fewer msgs per xct

Finer grained configurations are more sensitive to distributed transactions

0% 50% 100%0

10

20

30

40

50

60

LoggingLockingCommunicationXct managementXct execution

Multisite transactions

Tim

e pe

r tra

nsac

tions

(µs)

Where are the bottlenecks? Read case

14

4 Islands10 rows

Communication overhead dominates

15

Increasing size of multisite xct: read case

0 10 20 30 40 50 60 70 80 90 1000

20

40

60

80

100

120

140 24 Islands12 Islands4 Islands1 Island

Number of rows retrieved

Tim

e pe

r tra

nsac

tion

(μs) Physical

contention

More instances per transaction

All instances involved in a transaction

Communication costs rise until all instances are involved in every transaction

16

Increasing % of multisite xcts: updates

0 20 40 60 80 1000

20000

40000

60000

80000

100000

120000 1 Island24 Islands4 Islands

% Multisite transactions

Thro

ughp

ut (K

Tps)

No latches

2 round of messages+Extra logging

+Lock held longer

Distributed update transactions are more expensive

0% 50% 100%0

50

100

150

200

250

300

LoggingLockingCommunicationXct managementXct execution

Multisite transactions

Tim

e pe

r tra

nsac

tions

(µs)

Where are the bottlenecks? Update case

17

4 Islands10 rows

Communication overhead dominates

18

Increasing size of multisite xct: update case

0 10 20 30 40 50 60 70 80 90 1000

50

100

150

200

25024 Islands12 Islands4 Islands1 Island

Number of rows updated

Tim

e pe

r tra

nsac

tion

(μs)

Efficient logging with

Aether*

More instances per transaction

Increased contention

*R. Johnson, et al: Aether: a scalable approach to logging, VLDB 2010Shared everything exposes constructive interference

19

0 0.25 0.5 0.75 10

100000200000300000400000500000600000700000800000

50% multisite

Skew factor

0 0.25 0.5 0.75 10

100000200000300000400000500000600000700000800000

Local only

24 Islands4 Islands1 Island

Skew factor

Thro

ughp

ut (K

Tps)

Effects of skewed input

4 Islands effectively balance skew and contention

Few instances are highly loaded

Still few hot instances

Contention for hot data

Larger instances can balance load

OLTP systems on Hardware Islands• Shared-everything: stable, but non-optimal• Shared-nothing: fast, but sensitive to workload• OLTP Islands: a robust, middle-ground

– Runs on close cores– Small instances limits contention between threads– Few instances simplify partitioning

• Future work:– Automatically choose and setup optimal configuration– Dynamically adjust to workload changes

Thank you!20

Documents

OLTP on Hardware Islands