1 Hardware Transactional Memory Royi Maimon Merav Havuv 27/5/2007

Preview:

Citation preview

1

Hardware Transactional Memory

Royi MaimonMerav Havuv

27/5/2007

2

References

M. Herlihy and J. Moss,  Transactional Memory: Architectural Support for Lock-Free Data Structures 

C. Scott Ananian, Krste Asanovic, Bradley  C. Kuszmaul, Charles  E. Leiserson, Sean  Lie: Unbounded Transactional  Memory.

Hammond, Wong, Chen, Carlstrom, Davis (Jun 2004).“Transactional Memory Coherence and Consistency”

3

Today

What are transactions?

What is Hardware Transactional Memory?

Various implementations of HTM

4

Outline

Lock-Free Hardware Transactional Memory (HTM)

Transactions Cache coherence protocol General Implementation Simulation

UTM LTM TCC (briefly) Conclusions

5

Outline

Lock-Free Hardware Transactional Memory (HTM)

Transactions Cache coherence protocol General Implementation Simulation

UTM LTM TCC (briefly) Conclusions

6

Lock-free

A shared data structure is lock-free if its operations do not require mutual exclusion.

If one process is interrupted in the middle of an operation, other processes will not be prevented from operating on that object.

7

Lock-free data structures avoid common problems associated with conventional locking techniques in highly concurrent systems:

– Priority inversion

– Convoying occurs when a process holding a lock is descheduled, and then, other processes capable of running may be unable to progress.

– Deadlock

Lock-free (cont)

8

Priority inversion

Priority inversion occurs when a lower-priority process is preempted while holding a lock needed by higher-priority processes.

9

Deadlock

Deadlock – two or more processes are waiting indefinitely for an event that can be caused by only one of waiting processes.

Let S and Q be two resources

P0 P1

Lock(S) Lock(Q)Lock(Q) Lock(S)

10

Outline

Lock-Free Hardware Transactional Memory (HTM)

Transactions Cache coherence protocol General Implementation Simulation

UTM LTM TCC (briefly) Conclusions

11

What is a transaction?

A transaction is a sequence of memory loads and stores executed by a single process that either commits or aborts

If a transaction commits, all the loads and stores appear to have executed atomically

If a transaction aborts, none of its stores take effect Transaction operations aren't visible until they

commit or abort

12

Transactions properties:

A transaction satisfies the following properties:– Serializability

– Atomicity

Simplified version of traditional ACID database (Atomicity, Consistency, Isolation, and Durability)

13

Transactional Memory

A new multiprocessor architecture The goal: Implementing a lock-free synchronization

– efficient– easy to use

comparing to conventional techniques based on mutual exclusion

Implemented by straightforward extensions to multiprocessor cache-coherence protocols.

14

An Example

Locks:if (i<j) {

a = i; b = j; } else { a = j; b = i; } Lock(L[a]); Lock(L[b]); Flow[i] = Flow[i] – X; Flow[j] = Flow[j] + X; Unlock(L[b]); Unlock(L[a]);

Transactional Memory:

StartTransaction; Flow[i] = Flow[i] – X; Flow[j] = Flow[j] + X; EndTransaction;

15

Transactional Memory

Transactions execute in commit order

ld 0xdddd...st 0xbeef

Transaction ATime

ld 0xbeef

Transaction C

ld 0xbeef

Re-execute Re-execute with new datawith new data

Commit

ld 0xdddd...ld 0xbbbb

Transaction B

Commit Violation!Violation!

0xbeef0xbeef

16

Outline

Lock-Free Hardware Transactional Memory (HTM)

Transactions Cache coherence protocol General Implementation Simulation

UTM LTM TCC (briefly) Conclusions

17

Cache-Coherence Protocol

A protocol for managing the caches of a multiprocessor system:

– No data is lost– No overwritten before the data is transferred from a cache

to the target memory.

When multiprocessing, each processor may have its own memory cache that is separate from the shared memory

18

The Problem (Cache-Coherence)

Solving the problem in either of two ways:– directory-based– snooping system

19

Snoopy Cache

All caches watches the activity (snoop) on a global bus to determine if they have a copy of the block of data that is requested on the bus.

20

Directory-based

The data being shared is placed in a common directory that maintains the coherence between caches.

The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache.

When an entry is changed the directory either updates or invalidates the other caches with that entry.

21

Outline

Lock-Free Hardware Transactional Memory (HTM)

Transactions Cache coherence protocol General Implementation Simulation

UTM LTM TCC (briefly) Conclusions

22

How it Works?

The following primitive instructions for accessing memory are provided:

Load-transactional (LT): reads value of a shared memory location into a private register.

Load-transactional-exclusive (LTX): Like LT, but “hinting” that the location is likely to be modified.

Store-transactional (ST) tentatively writes a value from a private register to a shared memory location.

Commit (COMMIT) Abort (ABORT) Validate (VALIDATE) tests the current transaction status.

23

Some definitions

Read set: the set of locations read by LT by a transaction

Write set: the set of locations accessed by LTX or ST by a transaction

Data set (footprints): the union of the read and write sets.

A set of values in memory is inconsistent if it couldn’t have been produced by any serial execution of transactions

24

Intended Use

Instead of acquiring a lock, executing the critical section, and releasing the lock, a process would:

1. use LT or LTX to read from a set of locations2. use VALIDATE to check that the values read are

consistent,3. use ST to modify a set of locations4. use COMMIT to make the changes permanent.

If either the VALIDATE or the COMMIT fails, the process returns to Step (1).

25

Implementation

Transactional memory is implemented by modifying standard multiprocessor cache coherence protocols

We describe here how to extend “snoopy” cache protocol for a shared bus to support transactional memory

Our transactions are short-lived activities with relatively small Data set.

26

The basic idea

Any protocol capable of detecting accessibility conflicts can also detect transaction conflict at no extra cost

Once a transaction conflict is detected, it can be resolved in a variety of ways

27

Implementation

Each processor maintains two caches– Regular cache for non-transactional operations, – Transactional cache for transactional operations.

It holds all the tentative writes, without propagating them to other processors or to main memory (until commit)

Why using two caches?

28

Cache line states

Each cache line (regular or transactional) has one of the following states:

The transactional cache expends these states:

29

Cleanup

When the transactional cache needs space for a new entry, it searches for:– EMPTY entry

– If not found - a NORMAL entry

– finally for an XCOMMIT entry.

30

Processor actions

Each processor maintains two flags:– The transaction active (TACTIVE) flag: indicates whether a

transaction is in progress

– The transaction status (TSTATUS) flag: indicates whether that transaction is active (True) or aborted (False)

Non-transactional operations behave exactly as in original cache-coherence protocol

31

Example – LT operation:

Look for XABORT entry

Return it’s value

Look for NORMAL entry

Change it to XABORT and allocate another XCOMMIT entry

Found?Not Found?

Ask to read this block from the shared memory

Found?

Not Found?

Successful read

Create two entries: XABORT and XCOMMIT

Unsuccessful read

Abort the transaction:

TSTATUS=FALSE

Drop XABORT entries

All XCOMMIT entries are set to NORMAL

Cache miss

32

Snoopy cache actions:

Both the regular cache and the transactional cache snoop on the bus.

A cache ignores any bus cycles for lines not in that cache.

The transactional cache’s behavior:– If TSTATUS=False, or if the operation isn’t transactional,

the cache acts just like the regular cache, but ignores entries with state other than NORMAL

– On LT of other cpu, if the state is VALID, the cache returns the value, and for all other transactional operations it returns BUSY

33

Outline

Lock-Free Hardware Transactional Memory (HTM)

Transactions Cache coherence protocol General Implementation Simulation

UTM LTM TCC (briefly) Conclusions

34

Simulation

We’ll see an example code for the producer/consumer algorithm using transactional memory architecture.

The simulation runs on both cache coherence protocols: snoopy and directory cache.

The simulation use 32 processors The simulation finishes when 2^16 operations have

completed.

35

Part Of Producer/Consumer Code

typedef struct { Word deqs; // Holds the head’s index Word enqs; // Holds the tail’s index Word items[QUEUE_SIZE];} queue;

unsigned queue_deq(queue *q) { unsigned head, tail, result; unsigned backoff = BACKOFF_MIN unsigned wait; while (1) { result = QUEUE_EMPTY; tail = LTX(&q->enqs); head = LTX(&q->deqs); if (head != tail) { /* queue not empty? */ result = LT(&q->items[head % QUEUE_SIZE]); /* advance counter */ ST(&q->deqs, head + 1); } if (COMMIT()) break; /* abort => backoff */ wait = random() % (01 << backoff); while (wait--); if (backoff < BACKOFF_MAX) backoff++; } return result;}

36

The results:

37

In both HTM and STM the transactions shouldn’t touch many memory locations

There is a (small) bound on the transactions footprint

In addition, there is a duration limit.

So Far:

38

Outline

Lock-Free Hardware Transactional Memory (HTM)

Transactions Cache coherence protocol General Implementation Simulation

UTM LTM TCC (briefly) Conclusions

39

UTM – new thesis: supports transactions of arbitrary footprint and duration.

The UTM architecture allows:– transactions as large as virtual memory– transactions of unlimited duration– transactions which can migrate between processors

UTM supports a semantics for nested transactions

In contrast to previous HTM implementation: UTM is optimized for transactions below a certain size but still operate correctly for larger transactions

Unbounded Transactional Memory (UTM)

40

The Goal of UTM

The primary goal: – make concurrent programming easier.– Reducing implementation overhead.

Why do we want unbounded TM?

Neither programmers nor compilers can easily cope with an imposed hard limit on transaction size.

41

UTM architecture

The transaction log – data structure that maintains bookkeeping information for a transaction

Why is it needed?– Enables transactions to survive time slice

interrupts – Enables process migration from one processor to

another.

42

Two new instructions

All the programmer must specify is where a transaction begins and ends

XBEGIN pc– Begin a new transaction. Entry point to an abort handler

specified by pc.– If transaction must fail, roll back processor and memory

state to what it was when XBEGIN was executed, and jump to pc.

– We can think of an XBEGIN instruction as a conditional branch to the abort handler.

XEND– End the current transaction. If XEND completes, the

transaction is committed and appeared atomic.– Nested transactions are subsumed into outer transaction.

43

Transaction Semantics

XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XEND

L2: XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 XEND

Two transactions:– “A” has an abort handler at L1– “B” has an abort handler at L2

Here, very simplistic retry.

A

B

44

A name dependence occurs when two instructions Inst1 and Inst2 use the same register (or memory location), but there is no data transmitted between Inst1 and Inst2.

If the register is renamed so that Inst1 and Inst2 do not conflict, the two instructions can execute simultaneously or be reordered.

This technique that dynamically eliminates name dependences in registers, is called register renaming.

Register renaming can be done statically (= by compiler) or dynamically (= by hardware).

Register renaming

45

Rolling back processor state

After XBEGIN instruction we take a snapshot of the rename table

To keep track of busy registers, we maintain an S (saved) bit for each physical register to indicate which registers are part of the active transaction and it includes the S bits with every renaming-table snapshot

An active transaction’s abort handler address, nesting depth, and snapshot are part of its transactional state.

46

Memory State

UTM represents the set of active transactions with a single data structure held in system memory, the x-state (short for “transaction state”).

47

Xstate Implementation

The x-state contains a transaction log for each active transaction in the system.

Each log consists of:– A commit record: maintains the transaction’s status:

pending committed aborted

– A vector of log entries: corresponds to a memory block that the transaction has read or written to. The entry provides:

pointer to the block The block’s old value (for rollback) A pointer to the commit record Pointers that form a linked list of all entries in all transaction logs that

refer to the same block. (Reader List)

48

Xstate Implementation (Cont)

The final part of the x-state consists of:– log pointer– read-write bit

for each memory block

49

X-state Data Structure

42

Transaction log 1

PENDING

42

Transaction log 2

PENDING

32

32

42

Commit record

Old value

Block pointer

Reader list

Commit record pointer

Transaction log entry

W

log pointerRW bit

R

X-state

Application memory

Old value

Block pointer

Reader list

Commit record pointer

block

43

42

50

More on x-state

When a processor references a block that is already part of a pending transaction, the system checks the RW bit and log pointer to determine the correct action:

– use the old value

– use the new value

– abort the transaction

51

Commit action

42

Transaction log 1

PENDING

42

Transaction log 2

PENDING

32

43

42

Commit record

Old value

Block pointer

Reader list

Commit record pointer

Transaction log entry

W

log pointerRW bit

R

X-state

Application memory

Old value

Block pointer

Reader list

Commit record pointer

block

Transaction log 1

COMMITED

52

Cleanup action

42

Transaction log 1

COMMITED

42

Transaction log 2

PENDING

32

43

42

Commit record

Old value

Block pointer

Reader list

Commit record pointer

Transaction log entry

W

log pointerRW bit

R

X-state

Application memory

Old value

Block pointer

Reader list

Commit record pointer

block

53

Abort action

42

Transaction log 1

PENDING

42

Transaction log 2

PENDING

32

43

42

Commit record

Old value

Block pointer

Reader list

Commit record pointer

Transaction log entry

W

log pointerRW bit

R

X-state

Application memory

Old value

Block pointer

Reader list

Commit record pointer

block

Transaction log 1

ABORTED

32

42

54

Transactions Conflict

A conflict: When two or more pending transactions have accessed a block and at least one of the accesses is for writing.

Performing a transaction load:– check that the log pointer refers to an entry in the current

transaction log or the RW bit is R.

Performing a transaction store:– check that the log pointer references no other transaction

In case of a conflict, some of the conflicting transactions are aborted.

– Which transaction should be aborted?

55

Caching

For small transaction that fits in cache, UTM, like earlier proposed HTM systems, uses cache coherence protocol to identify conflicts

For transactions too big to fit in cache, the x-state for the transaction overflows into the ordinary memory hierarchy

– Most log entries don't need to be created

– Only create transaction log when transaction is run out of physical memory.

56

UTM’s Goal

support transactions that run for an indefinite length of time

migrate from one processor to another footprints bigger than the physical memory.

The main technique we propose is to treat the x-state as a systemwide data structure that uses global virtual addresses

57

Benefits and Limits of UTM

Limits:– Complicated implementation

Benefits:– Unlimited footprint– Unlimited duration– Migration possible– Good performance in the common case (small

transactions)

58

Outline

Lock-Free Hardware Transactional Memory (HTM)

Transactions Cache coherence protocol General Implementation Simulation

UTM LTM TCC (briefly) Conclusions

59

LTM: Visible, Large, Frequent, Scalable

“Large Transactional Memory”– Not truly unbounded, but simple and cheap

Minimal architectural changes, high performance– Small modifications to cache and processor core– No changes to main memory, cache coherence

protocol– Can be pin-compatible with conventional

processors

60

LTM’s Restrictions :

Limiting a transaction’s footprint to (nearly) the size of physical memory.

Duration must be less than a time slice Transactions cannot migrate between

processors.

With these restrictions, we can implement LTM by modifying only the cache and processor core

61

LTM vs UTM

Like UTM, LTM maintains data about pending transactions in the cache and detects conflicts using the cache coherency protocol

Unlike UTM, LTM does not treat the transaction as a data structure. Instead, it binds a transaction to a particular cache.

– Transactional data overflows from the cache into a hash table in main memory

LTM and UTM have similar semantics: XBEGIN and XEND instructions are the same

In LTM, the cache plays a major part…

62

Addition to Cache

LTM adds a bit (T) per cache line to indicate that the data has been accessed as part of a pending transaction.

An additional bit (O) is added per cache set to indicate that it has overflowed.

63

Cache overflow mechanism

O T Tag Data

Overflow hashtable

Key

ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND

Data

64

Cache overflow mechanism

1000 55

O T Tag Data

Overflow hashtable

Key

ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND

Data

65

Cache overflow: recording reads

T 1000 55

O T Tag Data

Overflow hashtable

Key

ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND

Data

66

Cache overflow: recording writes

T 1000 55

T 2000 66

O T Tag Data

Overflow hashtable

Key

ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND

Data

67

Cache overflow: spilling

T 3000 77

T 2000 66

O

1000 55

O T Tag Data

Overflow hashtable

Key

ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND

Data

68

Cache overflow: miss handling

T 1000 55

T 2000 66

O

3000 77

O T Tag Data

Overflow hashtable

Key

ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND

Data

69

LTM - Summary

Transactions as large as physical memory

Scalable overflow and commit

Easy to implement!

Low overhead

70

Outline

Lock-Free Hardware Transactional Memory (HTM)

Transactions Cache coherence protocol General Implementation Simulation

UTM LTM TCC (briefly) Conclusions

71

Transactional Memory Coherence and Consistency (TCC)

Hammond, Wong, Chen, Carlstrom, Davis (Jun 2004).“Transactional Memory Coherence and Consistency”

All transactions, all the time! Code partitioned into transactions by programmer or tools

– Possibly at run-time, for legacy code!

All writes buffered in caches, CPUs arbitrate system-wide for which one gets to commit

Updates broadcast to all CPUs. CPUs detect conflicts of their transactions and abort

72

TCC Implementation

r m V tag data

Commit control

Write buffer

Local cache hierarchy

Broadcast bus or network

snoopingcommits

CPU Corestoresonly

Loads & stores

73

Conclusions

Unbounded, scalable, and efficient Transactional Memory systems can be built.

– Support large, frequent, and concurrent transactions– Allow programmers to (finally!) use our parallel systems!

Three architectures:– LTM: easy to realize, almost unbounded– UTM: truly unbounded– TCC: high performance

74

THE END…

Royi Maimon

Merav Havuv