Transcript
Page 1: Applying HTM to an OLTP System: No Free Lunchevent.cwi.nl/damon2015/slides/slides-cervini.pdfTransactional Synchronization eXtensions •On Intel’s Haswell •RTM (Restricted Transactional

Applying HTM to an OLTP System: No Free Lunch

David Cervini, Danica Porobic, Pınar Tözün, Anastasia Ailamaki

Page 2: Applying HTM to an OLTP System: No Free Lunchevent.cwi.nl/damon2015/slides/slides-cervini.pdfTransactional Synchronization eXtensions •On Intel’s Haswell •RTM (Restricted Transactional

Why Hardware Transactional Memory?

• Multicores are here to stay

• Lock-based and lock-free programming is hard

• Transactional memory should ease programming

• Software transactional memory is not fast enough

Very promising for synchronization-heavy software2

Page 3: Applying HTM to an OLTP System: No Free Lunchevent.cwi.nl/damon2015/slides/slides-cervini.pdfTransactional Synchronization eXtensions •On Intel’s Haswell •RTM (Restricted Transactional

Why HTM and OLTP?

Many critical sections even for a simple transaction

0

10

20

30

40

50

60

Nu

mb

er

of

CSs

pe

r Tr

ansa

ctio

n other

xct manager

logging

buffer pool

catalog

latching

locking

3

Shore-MTRetrieving 1 row

Page 4: Applying HTM to an OLTP System: No Free Lunchevent.cwi.nl/damon2015/slides/slides-cervini.pdfTransactional Synchronization eXtensions •On Intel’s Haswell •RTM (Restricted Transactional

Promise:– HTM simplifies lock-free programming

– Shore-MT relies on fine-grained locking

– Expect performance improvement

A match made in heaven?

4

Page 5: Applying HTM to an OLTP System: No Free Lunchevent.cwi.nl/damon2015/slides/slides-cervini.pdfTransactional Synchronization eXtensions •On Intel’s Haswell •RTM (Restricted Transactional

Transactional Synchronization eXtensions

• On Intel’s Haswell

• RTM (Restricted Transactional Memory):

– Directly uses TM, more flexible

– _xbegin, _xabort, _xend, _xtest

– Requires new implementation

• HLE (Hardware Lock Elision):

– Speculative execution of existing locking code

– __ATOMIC_HLE_ACQUIRE or __ATOMIC_HLE_RELEASE

5

Page 6: Applying HTM to an OLTP System: No Free Lunchevent.cwi.nl/damon2015/slides/slides-cervini.pdfTransactional Synchronization eXtensions •On Intel’s Haswell •RTM (Restricted Transactional

TSX in a nutshell

6

• Uses cache coherency of L1 cache• Tracks data at cache line granularity

+ capacity andmisc aborts

conflict

Page 7: Applying HTM to an OLTP System: No Free Lunchevent.cwi.nl/damon2015/slides/slides-cervini.pdfTransactional Synchronization eXtensions •On Intel’s Haswell •RTM (Restricted Transactional

Experimental platform

• Software:– Shore-MT

– TM-1 benchmark: GetSubscriberData

– From 1 to 8 workers

– SLI enabled

– 80000 row dataset

• Hardware:– Intel i7-4770 3.4Ghz 4-core processor, hyperthreading on

– 16GB RAM

7

Page 8: Applying HTM to an OLTP System: No Free Lunchevent.cwi.nl/damon2015/slides/slides-cervini.pdfTransactional Synchronization eXtensions •On Intel’s Haswell •RTM (Restricted Transactional

Which lock types are used?

Many lock types for different use-cases

0

10

20

30

40

50

60

Nu

mb

er

of

CSs

pe

r Tr

ansa

ctio

n other

xct manager

logging

buffer pool

catalog

latching

locking

8

Shore-MTGetSubscriberData

occ_rwlock

tatas

srwlock

mcs

Page 9: Applying HTM to an OLTP System: No Free Lunchevent.cwi.nl/damon2015/slides/slides-cervini.pdfTransactional Synchronization eXtensions •On Intel’s Haswell •RTM (Restricted Transactional

HLE pthread instead of occ_rwlock

0

100

200

300

400

500

600

700

800

1 2 3 4 5 6 7 8

Thro

ugh

pu

t (K

Tps)

Number of threads

baseline

pthread_rwlock

No impact: pthread implementation is limited 9

Page 10: Applying HTM to an OLTP System: No Free Lunchevent.cwi.nl/damon2015/slides/slides-cervini.pdfTransactional Synchronization eXtensions •On Intel’s Haswell •RTM (Restricted Transactional

RTM-enabled lock example: acquirevoid occ_rwlock::acquire() {

#ifdef OCC_RWLOCK_RTM_WRAPPER

unsigned int status;

for(int i = 0; i < 2; i++) {

if ((status = _xbegin()) == _XBEGIN_STARTED) {

if (has_reader()) { _xabort(0xff); }

return;

} else if ((status & _XABORT_EXPLICIT) && _XABORT_CODE(status) == 0xff) {

while (__atomic_load_n(&_active_count, __ATOMIC_ACQUIRE))

_mm_pause();

} else if (status & _XABORT_CONFLICT) {

long int backoff=10*random()/RAND_MAX;

while (backoff--) _mm_pause();

} else if (status & _XABORT_RETRY) {

_mm_pause();

} else {

break;

}

}

#endif

/**original acquire code here**/

}

Retry transaction multiple times

Tune retry policies

Avoid Lemming Effect

RTM preferable for new code for flexibility 10

Page 11: Applying HTM to an OLTP System: No Free Lunchevent.cwi.nl/damon2015/slides/slides-cervini.pdfTransactional Synchronization eXtensions •On Intel’s Haswell •RTM (Restricted Transactional

RTM-enabled lock example: release

Ending a transaction requires no tuning 11

void occ_rwlock::release() {

#ifdef OCC_RWLOCK_RTM_WRAPPER

if (!has_reader() & _xtest()) {

_xend();

return;

}

#endif

/**original release code here**/

}

Must be inside a transaction

Page 12: Applying HTM to an OLTP System: No Free Lunchevent.cwi.nl/damon2015/slides/slides-cervini.pdfTransactional Synchronization eXtensions •On Intel’s Haswell •RTM (Restricted Transactional

RTM locks: good & bad news

0

100

200

300

400

500

600

700

800

1 2 3 4 5 6 7 8

Thro

ugh

pu

t(K

Tps)

Number of threads

baselineocc, tatas, mcs

12

HTM improves throughput 13-18%

1 2 3 4 5 6 7 8

0

100

200

300

400

500

600

700

800

Number of threads

Thro

ugh

pu

t (K

Tps)

baselineocc, tatas, mcs, srwlock

Improvement is not guaranteed

Page 13: Applying HTM to an OLTP System: No Free Lunchevent.cwi.nl/damon2015/slides/slides-cervini.pdfTransactional Synchronization eXtensions •On Intel’s Haswell •RTM (Restricted Transactional

Reason: aborts

13

Hyper-threading causes capacity aborts Real data conflicts cause high abort rates

0

5

10

15

20

25

1 2 3 4 5 6 7 8

Perc

ent

of

abo

rts

Number of threads

total

capacity

conflict

0

5

10

15

20

25

1 2 3 4 5 6 7 8

Perc

ent

of

abo

rts

Number of threads

Page 14: Applying HTM to an OLTP System: No Free Lunchevent.cwi.nl/damon2015/slides/slides-cervini.pdfTransactional Synchronization eXtensions •On Intel’s Haswell •RTM (Restricted Transactional

Coarse grained B-tree lock

14

Throughput drops by up to 73%Large critical section make aborts very expensive

0

100

200

300

400

500

600

700

800

1 2 3 4 5 6 7 8

Thro

ugh

pu

t (K

Tps)

Number of threads

baseline

coarse-grained lock0

5

10

15

20

25

1 2 3 4 5 6 7 8

Pe

rce

nt

of

abo

rts

Number of threads

total

capacity

conflict

Page 15: Applying HTM to an OLTP System: No Free Lunchevent.cwi.nl/damon2015/slides/slides-cervini.pdfTransactional Synchronization eXtensions •On Intel’s Haswell •RTM (Restricted Transactional

• Promise– TSX democratizes lock-free programming

– Shore-MT relies on fine-grained locking

– Possible match made in heaven

• Reality– Low hanging fruit: TSX is great for short critical sections

– Requires tuning – not always beneficial

– Cannot be used for large code sections

– Realizing full benefits requires system redesign

Applying HTM to an OLTP system

Thank you!15


Recommended