Transactional Locking

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Transactional Locking. Nir Shavit Tel Aviv University Joint work with Dave Dice and Ori Shalev. object. object. Shared Memory. Concurrent Programming. How do we make the programmers life simple without slowing computation down to a halt?!. b. c. d. a. A FIFO Queue. Head. Tail. - PowerPoint PPT Presentation

Transcript

  • Transactional LockingNir ShavitTel Aviv UniversityJoint work with Dave Dice and Ori Shalev

  • Concurrent ProgrammingHow do we make the programmers life simple without slowing computation down to a halt?!

  • A FIFO QueueTailHeadEnqueue(d)Dequeue() => a

  • A Concurrent FIFO Queuesynchronized{}Object lock

  • Fine Grain LocksTailHeadP: Dequeue() => aQ: Enqueue(d)Better Performance, More Complex CodeWorry about deadlock, livelock

  • Lock-Free (JSR-166) TailHeadP: Dequeue() => aQ: Enqueue(d)Even Better Performance, Even More Complex CodeWorry about deadlock, livelock, subtle bugs, hard to modify

  • Transactional Memory [Herlihy-Moss] TailHeadP: Dequeue() => aQ: Enqueue(d)Dont worry about deadlock, livelock, subtle bugs, etcGreat Performance, Simple Code

  • Transactional Memory [Herlihy-Moss]TailHeadP: Dequeue() => aQ: Enqueue(d)Dont worry about deadlock, livelock, subtle bugs, etcGreat Performance, Simple Code

  • TM: How Does It Work synchronized{

    } Execute all synchronized instructions as an atomic transactionSimplicity of Global Lock with Granularity of Fine-Grained Implementation

  • Hardware TM [Herlihy-Moss] Limitations: atomic{}

    Machines will differ in their support

    When we build 1000 instruction transactions, it will not be for free

  • Software Transactional MemoryImplement transactions in SoftwareAll the flexibility of hardwaretodayAbility to extend hardware when it is available (Hybrid TM)But there are problems: Performance?Ease of programming (software engineering)?Mechanical code transformation?

  • The Breif History of STMLock-freeObstruction-freeLock-based

  • As Good As Fine GrainedPostulate (i.e. take it or leave it):

    If we could implement fine-grained locking with the same simplicity of course grained, we would never think of building a transactional memory.

    Implication:

    Lets try to provide TMs that get as close as possible to hand-crafted fine-grained locking.

  • Premise of Lock-based STMsMemory Lifecycle: work with GC or any malloc/freeTransactification: allow mechanical transformation of sequential codePerformance: match fine grained

    Safety: work on coherent state

    Unfortunately: Hybrid, Ennals, Saha, AtomJava deliver only 2 and 3 (in some cases)

  • Transactional LockingTL2 Delivers all four propertiesHow ? - Unlike all prior algs: use Commit time locking instead of Encounter order locking - Introduce Version Clock mechanism for validation

  • TL Design Choices MapArray of Versioned-Write-LocksApplication MemoryPS = Lock per Stripe (separate array of locks)PO = Lock per Object(embedded in object)

  • Encounter Order Locking (Undo Log)To Read: load lock + locationCheck unlocked add to Read-SetTo Write: lock location, store value Add old value to undo-setValidate read-set v#s unchangedRelease each lock with v#+1XYMem LocksQuick read of values freshly written by the reading transaction[Ennals,Hybrid,Saha,Harris,]

  • Commit Time Locking (Write Buff)To Read: load lock + locationLocation in write-set? (Bloom Filter)Check unlocked add to Read-SetTo Write: add value to write setAcquire LocksValidate read/write v#s unchangedRelease each lock with v#+1Mem LocksHold locks for very short durationXY[TL,TL2]

  • Why COM and not ENC?Under low load they perform pretty much the same.COM withstands high loads (small structures or high write %). ENC does not withstand high loads.COM works seamlessly with Malloc/Free. ENC does not work with Malloc/Free.

  • COM vs. ENC High LoadRed-Black Tree 20% Delete 20% Update 60% Lookup

  • COM vs. ENC Low LoadRed-Black Tree 5% Delete 5% Update 90% Lookup

  • Obnoxious Statement About BenchmarkingPick sequential algorithms and show how they do when parallelizedPlease no more Specjbb, SplashCompare to other STMs and hand-crafted fine-grained implementations

  • COM: Works with Malloc/FreePS Lock ArrayABTo free B from transactional space: Wait till its lock is free. Free(B)

    B is never written inconsistently because any write is preceded by a validation while holding lockVALIDATEX FAILSIF INCONSISTENT

  • ENC: Fails with Malloc/FreePS Lock ArrayABCannot free B from transactional space because undo-log means locations are written after every lock acquisition and before validation.

    Possible solution: validate after every lock acquisition (yuck)VALIDATEX

  • Problem: Application SafetyAll current lock based STMs work on inconsistent states. They must introduce validation into user code at fixed intervals or loops, use traps, OS support,And still there are cases, however rare, where an error could occur in user code

  • Solution: TL2s Version ClockHave one shared global version clockIncremented by (small subset of) writing transactionsRead by all transactionsUsed to validate that state worked on is always consistent

    Later: how we learned not to worry about contention and love the clock

  • Version Clock: Read-Only COM TransRV VClockOn Read: read lock, read mem, read lock: check unlocked, unchanged, and v#
  • Version Clock: Writing COM TransRV VClockOn Read/Write: check unlocked and v#
  • Version Clock ImplementationOn sys-on-chip like Sun T200 Niagara: virtually no contention, just CAS and be happyOn others: add TID to VClock, if VClock has changed since last write can use new value +TID. Reduces contention by a factor of N. Future: Coherent Hardware VClock that guarantees unique tick per access.

  • Performance BenchmarksMechanically Transformed Sequential Red-Black Tree using TL2 Compare to STMs and hand-crafted fine-grained Red-Black implementationOn a 16way Sun Fire running Solaris 10

  • Uncontended Large Red-Black Tree5% Delete 5% Update 90% LookupHand-craftedTL/PSTL2/PSTL/PO TL2/P0EnnalsFarserHarrisLock-free

  • Uncontended Small RB-Tree5% Delete 5% Update 90% LookupTL/P0TL2/P0

  • Contended Small RB-Tree30% Delete 30% Update 40% LookupEnnalsTL/P0TL2/P0

  • Speedup: Normalized ThroughputHand-CraftedTL/POLarge RB-Tree 5% Delete 5% Update 90% Lookup

  • Overhead Overhead OverheadSTM scalability is as good if not better than hand-crafted, but overheads are much higherOverhead is the dominant performance factor bodes well for HTMRead set and validation cost (not locking cost) dominates performance

  • On Sun T200 (Niagara): maybe a long way to goRB-tree 5% Delete 5% Update 90% LookupHand-craftedSTMs

  • Detail of RB-tree STMs OnlyRB-tree 5% Delete 5% Update 90% Lookup

  • ConclusionsCOM time locking, implemented efficiently, has clear advantages over ENC order locking: No meltdown under contentionWorking seamlessly with malloc/freeVCounter can guarantee safety so we dont need to embed repeated validation in user code

  • What Next?Further improve performanceMake TL1 and TL2 library availableMechanical code transformation toolCut read-set and validation overhead, maybe with hardware support?Add hardware VClock to Sys-on-chip.

  • Thank You

    One can build longer transactions in hardware but all this has a costWe have a long way top go but in some cases are starting to do wellHolding locks long exposes COM to contention. Order of magnitude faster than MCS lockHanke 10-12 times faster than lockNiagara delivers 3 times the throughput of SunFire with half the processors and cost a factor of 10 less. This is only a 1st generation machine, so the number of instructions on cvomputation path is dominant and