Upload
verity
View
33
Download
0
Embed Size (px)
DESCRIPTION
Transactional Memory Architectural Support for a Lock-Free Data Structure. By: Major Bhadauria. Some material borrowed from : Konrad Lai, Microprocessor Technology Labs, Intel Intel Multicore University Research Conference Dec 8, 2005. Motivation. - PowerPoint PPT Presentation
Citation preview
Copyright © 2006, CS 612
Transactional MemoryArchitectural Support for a Lock-Free Data Structure
Transactional MemoryArchitectural Support for a Lock-Free Data Structure
Some material borrowed from :Some material borrowed from :
Konrad Lai, Microprocessor Technology Labs, IntelKonrad Lai, Microprocessor Technology Labs, Intel
Intel Multicore University Research ConferenceIntel Multicore University Research Conference
Dec 8, 2005Dec 8, 2005
By: Major BhadauriaBy: Major Bhadauria
2Copyright © 2006, CS 612
MotivationMotivation
Multiple cores face a serious programmability problem Multiple cores face a serious programmability problem Writing correct parallel programs is very difficultWriting correct parallel programs is very difficult
Transactional MemoryTransactional Memory addresses addresses key part of the problem key part of the problem Makes parallel programming easier by simplifying coordinationMakes parallel programming easier by simplifying coordination Requires hardware support for performanceRequires hardware support for performance Implements Implements lock-freelock-free data structures data structures
3Copyright © 2006, CS 612
Benefits of Going Lock-FreeBenefits of Going Lock-Free
Priority inversion: lower-priority process is preempted while Priority inversion: lower-priority process is preempted while holding lock needed by a higher-priority processholding lock needed by a higher-priority process
Convoying: A process holding a lock is de-scheduled, Convoying: A process holding a lock is de-scheduled, possibly stopping others programs from progressing possibly stopping others programs from progressing needlessly.needlessly.
Deadlock: Common problem of process A and B both Deadlock: Common problem of process A and B both needing to lock C and D. needing to lock C and D.
Process B has lock for D.Process B has lock for D.
Process A has lock for CProcess A has lock for C
DEADLOCKDEADLOCK
ALLOWS US TO AVOID:ALLOWS US TO AVOID:
4Copyright © 2006, CS 612
Problem: Lock-Based SynchronizationProblem: Lock-Based Synchronization
Software engineering problemsSoftware engineering problems Lock-based programs do not composeLock-based programs do not compose Performance and correctness tightly coupledPerformance and correctness tightly coupled Timing dependent errors are difficult to find and debugTiming dependent errors are difficult to find and debug
Performance problemsPerformance problems High performance requires finer grain lockingHigh performance requires finer grain locking More and more locks add more overheadMore and more locks add more overhead
Need a better concurrency model for multi-core softwareNeed a better concurrency model for multi-core software
Lock-based synchronization of shared data access Lock-based synchronization of shared data access Is fundamentally problematicIs fundamentally problematic
5Copyright © 2006, CS 612
What is Transactional Memory (TM)?What is Transactional Memory (TM)?
Basic mechanisms:Basic mechanisms: Isolation: Track read and writes, detect when conflicts occurIsolation: Track read and writes, detect when conflicts occur Version management: Record new/old valuesVersion management: Record new/old values Atomicity: Commit new values or abort back to old valuesAtomicity: Commit new values or abort back to old values
Transactional Memory (TM) allows arbitrary multiple memory locations to Transactional Memory (TM) allows arbitrary multiple memory locations to be updated atomically and serially.be updated atomically and serially.
begin_xactionbegin_xaction
A = A – 20A = A – 20
B = B + 20B = B + 20
A = A – BA = A – B
C = C + 20C = C + 20
end_xactionend_xaction
Thread 1Thread 1begin_xactionbegin_xaction
C = C - 30C = C - 30
A = A + 30A = A + 30
end_xactionend_xaction
Thread 2Thread 2Thread 1’s accesses Thread 1’s accesses and updates to A, B, C and updates to A, B, C are atomicare atomic
Thread 2 sees either Thread 2 sees either “all” or “none” of “all” or “none” of Thread 1’s updatesThread 1’s updates
6Copyright © 2006, CS 612
Transactional Memory benefitsTransactional Memory benefits
Focus: Multithreaded programmability crisisFocus: Multithreaded programmability crisis Programmability & performanceProgrammability & performance
Allows conservative synchronizationAllows conservative synchronization
Programmer focuses on parallelism & correctness, HW extracts Programmer focuses on parallelism & correctness, HW extracts performanceperformance
Software engineering and composabilitySoftware engineering and composabilityAllows library re-use and composition (locks make this very difficult)Allows library re-use and composition (locks make this very difficult)
Critical for wide multi-core demandCritical for wide multi-core demand
Makes high performance MT programming easierMakes high performance MT programming easier Captures a fundamental, well-known, intuitive “atomic” constructCaptures a fundamental, well-known, intuitive “atomic” construct
Been around for decadesBeen around for decades
Similar to a “critical section” but without its problemsSimilar to a “critical section” but without its problemsNo deadlocks, priority inversion, data races, unnecessary serializationNo deadlocks, priority inversion, data races, unnecessary serialization
7Copyright © 2006, CS 612
Software Transactional MemorySoftware Transactional Memory
Software Transactional Memory (1995 until now)Software Transactional Memory (1995 until now) Significant work from Sun, Brown, Cambridge, MicrosoftSignificant work from Sun, Brown, Cambridge, Microsoft
Serious performance limitationsSerious performance limitationsDegrades “common” case of no conflicts/contentionDegrades “common” case of no conflicts/contention
>90% of transactions are no conflicts>90% of transactions are no conflicts
– 90% of critical sections are uncontended: what if all these slowed down by 5X?
Serious deployability limitationsSerious deployability limitationsRelies on special runtime supportRelies on special runtime support
Invasive to applications and librariesInvasive to applications and libraries
Is STM is too slow and too invasive to deploy?Is STM is too slow and too invasive to deploy?Could there be better implementation?Could there be better implementation?
But great to understand complex usage models of the futureBut great to understand complex usage models of the future
8Copyright © 2006, CS 612
Herlihy & Moss’ ImplementationHerlihy & Moss’ Implementation
Minimum implementation builds on previous LL/SC structure:Minimum implementation builds on previous LL/SC structure: Loading a shared value to readLoading a shared value to read
Writing to a shared valueWriting to a shared value
Commit/Abort Functions to commit or flush new valuesCommit/Abort Functions to commit or flush new values
Extensions for performance:Extensions for performance: Store copy of old value in cache – reduce bus traffic, latencyStore copy of old value in cache – reduce bus traffic, latency
Store committed data in cache – reduce bus trafficStore committed data in cache – reduce bus traffic
Have a read value that you can later change. – reduce bus trafficHave a read value that you can later change. – reduce bus traffic
Have Validate function to check current status – reduce number of Have Validate function to check current status – reduce number of orphan transactionsorphan transactions
Busy bus signal for transactions - reduce number of abortsBusy bus signal for transactions - reduce number of aborts
9Copyright © 2006, CS 612
Herlihy & Moss’ ImplementationHerlihy & Moss’ Implementation
Transactional Memory primitives:Transactional Memory primitives: Load-transactional (LT): reads value of a shared memory location into Load-transactional (LT): reads value of a shared memory location into
a private register.a private register.
Load-transactional-exclusive (LTX): similar to LT, but indicates that Load-transactional-exclusive (LTX): similar to LT, but indicates that likely to be updated.likely to be updated.
Store-transactional (ST): tentatively writes a value to a shared Store-transactional (ST): tentatively writes a value to a shared memory location, but not visible until a successful committal.memory location, but not visible until a successful committal.
TM instructions- commit, abort, validate.TM instructions- commit, abort, validate. Commit makes tentative changes permanent if the transaction’s read Commit makes tentative changes permanent if the transaction’s read
set (locations referenced by LT) has not changed, and no other set (locations referenced by LT) has not changed, and no other process had read any location within the write set (locations process had read any location within the write set (locations referenced by LTX and ST)referenced by LTX and ST)
Abort discards updates to write set.Abort discards updates to write set.
Validate –tests the current transaction status, T-continue, F-AbortValidate –tests the current transaction status, T-continue, F-Abort
10Copyright © 2006, CS 612
Example Example
11Copyright © 2006, CS 612
Herlihy & Moss’ ImplementationHerlihy & Moss’ Implementation
Hardware Modifications (somewhat)Hardware Modifications (somewhat)
OLD
NEW
12Copyright © 2006, CS 612
Hardware Support for PerformanceHardware Support for Performance
A.sum = 100
B.sum = 200
Core 1Core 1begin_xaction
A.withdraw(20)
B.deposit(20)
end_xaction
Architectural Memory stateArchitectural Memory state
A.sum = 100B.sum = 200
1
1
A.sum = 80
B.sum = 220
A.sum = 80
B.sum = 220
1.1. Record recovery stateRecord recovery state2.2. Buffer updates/track accessesBuffer updates/track accesses3.3. Commit if no external access (discard Commit if no external access (discard
all updates if conflict)all updates if conflict)
1
Core 2Core 2
1
begin_xaction
Sum =
A.sum + B.sum
end_xaction
Coherence protocol Coherence protocol for conflictsfor conflicts
Core 2 sees $300 – never $280 or $320Core 2 sees $300 – never $280 or $320
A.sum = 100
B.sum = 200
1
1
Abort & restart
Abort & restart
13Copyright © 2006, CS 612
Herlihy & Moss’ BenchmarksHerlihy & Moss’ Benchmarks
Compared against:Compared against: Test-and-test-and-set (TTS) w/ exponential backoffTest-and-test-and-set (TTS) w/ exponential backoff
Software queuing (QOSB mechanism)Software queuing (QOSB mechanism)
LOAD_LINKED/STORE_COND (LL/SC)LOAD_LINKED/STORE_COND (LL/SC)
14Copyright © 2006, CS 612
Herlihy & Moss’ Test 1 ResultsHerlihy & Moss’ Test 1 Results
15Copyright © 2006, CS 612
Herlihy & Moss’ ImplementationHerlihy & Moss’ Implementation
LimitationsLimitations Transaction size implementation-dependentTransaction size implementation-dependent
Must finish in a single scheduling quantumMust finish in a single scheduling quantum
Locations accessed cannot exceed architecturally specified limitLocations accessed cannot exceed architecturally specified limit
Short duration, small data setsShort duration, small data sets
Requires cache coherence model that’s sequentially consistent Requires cache coherence model that’s sequentially consistent (otherwise may need fences)(otherwise may need fences)
Nested transactions problematicNested transactions problematic
Hence Programmer Must Be Aware of HardwareHence Programmer Must Be Aware of Hardware
16Copyright © 2006, CS 612
What if HW is not sufficient?What if HW is not sufficient?
This is the deployability challengeThis is the deployability challenge The missing piece of the puzzle for all prior work…The missing piece of the puzzle for all prior work…
Resource limitations are fundamentalResource limitations are fundamental Space: cachesSpace: caches
More HW delays the inevitable: will always be an n+1 caseMore HW delays the inevitable: will always be an n+1 case
Time: scheduling quantaTime: scheduling quantaProgrammers have no control over timeProgrammers have no control over time
Affects functionality, not just performanceAffects functionality, not just performanceSome transactions may never completeSome transactions may never complete
Making HW limit explicit is difficultMaking HW limit explicit is difficult Limited usage onlyLimited usage only
Unreasonable for high level languagesUnreasonable for high level languages
How do you architect it in an evolvable manner?How do you architect it in an evolvable manner?
17Copyright © 2006, CS 612
Virtualizing Transactional MemoryVirtualizing Transactional Memory
Hides hardware limitations like virtual memory hides physical Hides hardware limitations like virtual memory hides physical memory limitationsmemory limitations
2 modes2 modes 1 for common case which is built into HW1 for common case which is built into HW
22ndnd for buffer overflow, page faults, context switches or thread for buffer overflow, page faults, context switches or thread migration which is built using SW and HW data structuresmigration which is built using SW and HW data structures
Virtualization allows transactions to: be suspended, migrate, Virtualization allows transactions to: be suspended, migrate, or overflow state from local buffers, and allows nested or overflow state from local buffers, and allows nested transactionstransactions
Extra challenge: Unlike normal TM, can’t only use cache Extra challenge: Unlike normal TM, can’t only use cache coherence mechanisms, since need to detect conflicts b/w coherence mechanisms, since need to detect conflicts b/w active transactions and transactions whose state partially or active transactions and transactions whose state partially or completely overflowed to virtual memorycompletely overflowed to virtual memory
18Copyright © 2006, CS 612
Virtualize for CompletenessVirtualize for Completeness
Timer interrupts,Timer interrupts,Context switches,Context switches,
Exceptions,…Exceptions,…
Limited buffersLimited buffers
Core 1Core 1
111
Virtual TM
virtual address spacevirtual address space
Log/buffer space
Overflow management Overflow management Using virtual memoryUsing virtual memory
Software libs. and microcodeSoftware libs. and microcode
Out-of-band concurrency controlOut-of-band concurrency control
Programmer transparentProgrammer transparent
Performance isolationPerformance isolation
Suspendable/swappableSuspendable/swappable
19Copyright © 2006, CS 612
Recent TM ResearchRecent TM Research
Recently, focus on solving the harder problem of TMRecently, focus on solving the harder problem of TM Making the model immune to cache buffer size limitations, Making the model immune to cache buffer size limitations,
scheduling limitations, etc.scheduling limitations, etc.
TCC (Stanford) (2004)TCC (Stanford) (2004) Same limitation as Herlihy/Moss for TM (size limited to local caches)Same limitation as Herlihy/Moss for TM (size limited to local caches)
LTM (MIT), VTM (Intel), LogTM (Wisconsin) (2005)LTM (MIT), VTM (Intel), LogTM (Wisconsin) (2005) Assume hardware TM supportAssume hardware TM support
Add support to allow transactions to be immune to resource Add support to allow transactions to be immune to resource limitationslimitations
Goals of each similar, approaches very differentGoals of each similar, approaches very differentLTM: only resource overflowLTM: only resource overflow
VTM: complete virtualizationVTM: complete virtualization
LogTM: only resource overflow, AND optimize for COMMITsLogTM: only resource overflow, AND optimize for COMMITs
20Copyright © 2006, CS 612
Some Research ChallengesSome Research Challenges
Large transactionsLarge transactions
Language extensionsLanguage extensions
IO, loophole, escape hatches, …IO, loophole, escape hatches, …
Interaction and co-existence withInteraction and co-existence with Other synchronization schemes: locks, flags, …Other synchronization schemes: locks, flags, …
Other transactionsOther transactions Database transactionDatabase transaction
System transaction (Microsoft)System transaction (Microsoft)
Other libraries, system software, operating system, …Other libraries, system software, operating system, …
Performance monitor, tuning, debugging, …Performance monitor, tuning, debugging, …
Open vs Closed NestingOpen vs Closed Nesting
Interaction between transaction & non-transactionInteraction between transaction & non-transaction
Usage & WorkloadUsage & Workload PLDI workshopPLDI workshop
21Copyright © 2006, CS 612
TM – First DecadeTM – First Decade
IBM 801 Database Storage (1980s)IBM 801 Database Storage (1980s) Lock attribute bits on Lock attribute bits on virtual memoryvirtual memory (via TLBs, PTEs) at 128 byte (via TLBs, PTEs) at 128 byte
granularitygranularity
Load-Linked/Store Conditional (Jensen et al. 1987)Load-Linked/Store Conditional (Jensen et al. 1987) Optimistic update of single cache line (Alpha, MIPS, PowerPC)Optimistic update of single cache line (Alpha, MIPS, PowerPC)
Transactional Memory (Herlihy&Moss 1993)Transactional Memory (Herlihy&Moss 1993) Coined term; TM generalization of LL/SCCoined term; TM generalization of LL/SC Instructions explicitly identify transactional loads and storesInstructions explicitly identify transactional loads and stores Used dedicated transaction cacheUsed dedicated transaction cache
Size of transactions limited to transaction cacheSize of transactions limited to transaction cache
Oklahoma Update (Stone et al./IBM 1993)Oklahoma Update (Stone et al./IBM 1993) Similar to TM, concurrently proposedSimilar to TM, concurrently proposed Didn’t use cache but Didn’t use cache but dedicated monitored registersdedicated monitored registers to operate upon to operate upon
22Copyright © 2006, CS 612
ReferencesReferences
1.1. ““Transactional Memory: Architectural Support for Lock-Free Transactional Memory: Architectural Support for Lock-Free Data Structures”, Moss et. al, ISCA 1993Data Structures”, Moss et. al, ISCA 1993
2.2. ““Virtualizing Transactional Memory” K. Lai et. al, ISCA 2005Virtualizing Transactional Memory” K. Lai et. al, ISCA 2005
3.3. ““LogTM: Log-based Transactional Memory”, Wood et. al, LogTM: Log-based Transactional Memory”, Wood et. al, HPCA-12HPCA-12