57
Implementing, Simulating and Extending Hardware Transactional Memory PhD Status (Statusvortrag) stephan.diestelhorst@gmail. com Betreuer: Prof. Christof Fetzer Fachreferent: Prof. Hermann

Implementing, Simulating and Extending Hardware Transactional Memory PhD Status (Statusvortrag) [email protected] Betreuer: Prof. Christof

Embed Size (px)

Citation preview

Implementing, Simulating and Extending Hardware Transactional

MemoryPhD Status (Statusvortrag)

[email protected]

Betreuer: Prof. Christof FetzerFachreferent: Prof. Hermann Härtig

Executive Summary

• Transactional Memory is great tool for fast parallel code

• Expensive HW implementation -> restrict features, reuse components

• Keep OS-invisible to aid adoption• Tight relationship between implementation

details and provided features

=> Can still innovate in constrained space. I do it.

2013-03-27 2Stephan Diestelhorst - HTM

Processing TrendsJeff Preshing, A Look Back at Single-Threaded CPU Performance, Feb 2008http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance

2013-03-27 3Stephan Diestelhorst - HTM

Processing TrendsChuck Moore, Data Processing inExascale-Class Computer Systems, The Salishan Conference on High Speed Computing , Apr 2011http://www.lanl.gov/orgs/hpc/salishan/salishan2011/3moore.pdf

2013-03-27 4Stephan Diestelhorst - HTM

Processing Trends

• Single thread performance stalling in 2004• Multi-core CPUs penetrating the market– Top-notch smartphones with four CPU cores– Enthusiast desktop machines with four – eight

cores

• Not all problems are coarse-grained data parallel (so your cluster doesn‘t help) -> SMPs, synchronise

2013-03-27 5Stephan Diestelhorst - HTM

Synchronising SMPs• Coarse locks– Size of system limits performance– Low contention data vs. high contention lock

• Fine-grained locking– Instruction overhead– Lock order– Composability

• Lock-free data structures– Overheads– Complexity

2013-03-27 6Stephan Diestelhorst - HTM

Transactional Memory (1993)• Speculative execution of

transactions• Tentative stores (data

versioning)• Monitor working set for

concurrent conflicting accesses (conflict detection)

• Make tentative updates visible at once

Local to the core (point of coherence), fine-grained benefit w/o cost

2013-03-27 Stephan Diestelhorst - HTM 7

Maurice Herlihy, and J. Eliot B. Moss. “Transactional memory: Architectural support for lock-free data structures.” Proceedings of the 20th annual international symposium on computer architecture (ISCA '93). ACM, 1993

Txload [foo]

Txstore [bar]

Commit

Begin

Txstore [foo]Commit

Begin

Roadmap

• Introduction• Related Work– HTM microarchitecture

• Academia• Industry

– HTM ISA– HTM++ / HTM--– Simulation

• Contributions

• IS:– Focussed on hardware /

low-level software interaction

– TM in coherent system• IS NOT:

– A full history of TM– Computer architecture

course– Language integration of

TM– Distributed TM

2013-03-27 8Stephan Diestelhorst - HTM

Roadmap

• Introduction• Related Work– HTM microarchitecture

• Academia• Industry

– HTM ISA– HTM++ / HTM--– Simulation

• ContributionsMicroarchitecture

RTL

Instruction Set

Libraries

Operating System

Applications

2013-03-27 9Stephan Diestelhorst - HTM

Roadmap

• Introduction• Related Work– HTM microarchitecture

• Academia• Industry

– HTM ISA– HTM++ / HTM--– Simulation

• ContributionsMicroarchitecture

RTL

Instruction Set

Libraries

Operating System

Applications

2013-03-27 10Stephan Diestelhorst - HTM

Roadmap

• Introduction• Related Work– HTM microarchitecture

• Academia• Industry

– HTM ISA– HTM++ / HTM--– Simulation

• ContributionsMicroarchitecture

RTL

Instruction Set

Libraries

Operating System

Applications

2013-03-27 11Stephan Diestelhorst - HTM

HTM Microarchitecture - Basics

• Conflict detection– Location, capacity– Eager / lazy

• Data versioning– Location, capacity– Make visible

• Integration with baseline microarchitecture

2013-03-27 Stephan Diestelhorst - HTM 12

Txload [foo]

Txstore [bar]=5

Commit

Begin

Txstore [foo]=7Commit

Begin

foo | txrbar | txw

foo | txw

bar | 0 -> 5 foo | 0 -> 7

Eage

rLa

zy

HTM Microarchitecture - Academia

• 2004 – Hammond, et al – Transactional Memory and Consistency (TCC)

• 2005 – Ananian, et al – Unbounded Transactional Memory (UTM)

• 2006 – Shriraman, et al – An Integrated Hardware-Software Approach to Flexible Transactional Memory (RTM)

• 2007 – Yen, et al – LogTM-SE: Decoupling Hardware Transactional Memory From Caches

2013-03-27 13Stephan Diestelhorst - HTM

1993 TM 2004 TCC 2005 UTM 2006 RTM 2007 LogTM-SE

CPU Core Emulated

Coherence Pt L1d

Conf. Det. TX$, eager

#TX Reads TX$ size

Buffering TX$

Undo TX

#TX Writes TX$ size

Abort Holder wins, async

Changes to baseline system

(L1d), TX$

2013-03-27 14Stephan Diestelhorst - HTM

1993 TM 2004 TCC 2005 UTM 2006 RTM 2007 LogTM-SE

CPU Core Emulated Emulated

Coherence Pt L1d Local cache

Conf. Det. TX$, eager Local cache, lazy

#TX Reads TX$ size Local cache size

Buffering TX$ Write buffer

Undo TX Mem

#TX Writes TX$ size Write buffer size

Abort Holder wins, async

First commit wins, ordered

Changes to baseline system

(L1d), TX$ Replace coherency / consistency

2013-03-27 15Stephan Diestelhorst - HTM

1993 TM 2004 TCC 2005 UTM 2006 RTM 2007 LogTM-SE

CPU Core Emulated Emulated ?

Coherence Pt L1d Local cache ?

Conf. Det. TX$, eager Local cache, lazy

TX bits for all mem, eager

#TX Reads TX$ size Local cache size

Infinite

Buffering TX$ Write buffer Mem

Undo TX Mem Mem-log

#TX Writes TX$ size Write buffer size

Infinite

Abort Holder wins, async

First commit wins, ordered

Older wins

Changes to baseline system

(L1d), TX$ Replace coherency / consistency

Reg Rename, Visible Regs, LS, $, OS

2013-03-27 16Stephan Diestelhorst - HTM

1993 TM 2004 TCC 2005 UTM 2006 RTM 2007 LogTM-SE

CPU Core Emulated Emulated ? In-order, Simics / GEMS

Coherence Pt L1d Local cache ? L1d

Conf. Det. TX$, eager Local cache, lazy

TX bits for all mem, eager

L1d, eager / lazy

#TX Reads TX$ size Local cache size

Infinite L1d

Buffering TX$ Write buffer Mem Inner $

Undo TX Mem Mem-log Outer $

#TX Writes TX$ size Write buffer size

Infinite L1d

Abort Holder wins, async

First commit wins, ordered

Older wins SW policy

Changes to baseline system

(L1d), TX$ Replace coherency / consistency

Reg Rename, Visible Regs, LS, $, OS

Coherence states & protocol, L1d, visible regs, OS

2013-03-27 17Stephan Diestelhorst - HTM

1993 TM 2004 TCC 2005 UTM 2006 RTM 2007 LogTM-SE

CPU Core Emulated Emulated ? In-order, Simics / GEMS

Ooo, Simics / GEMS

Coherence Pt L1d Local cache ? L1d L1d

Conf. Det. TX$, eager Local cache, lazy

TX bits for all mem, eager

L1d, eager / lazy

Signatures, eager

#TX Reads TX$ size Local cache size

Infinite L1d Infinite

Buffering TX$ Write buffer Mem Inner $ Mem

Undo TX Mem Mem-log Outer $ Mem-log

#TX Writes TX$ size Write buffer size

Infinite L1d Infinite

Abort Holder wins, async

First commit wins, ordered

Older wins SW policy Older wins, SW handler

Changes to baseline system

(L1d), TX$ Replace coherency / consistency

Reg Rename, Visible Regs, LS, $, OS

Coherence states & protocol, L1d, visible regs, OS

Coherence protocol, directory, OS

2013-03-27 18Stephan Diestelhorst - HTM

HTM Academia Conclusions

• Many proposals with new “widgets”• “Easy to implement”• Evaluation: simple in-order core & more

realistic memory models

• Preserve protocols? Preserve components?• Verification cost! (absence of)

2013-03-27 19Stephan Diestelhorst - HTM

HTM Microarchitecture - Industry

2013-03-27 20Stephan Diestelhorst - HTM

2009 Azul 2009 Rock 2011 BG/Q 2012 zEC12

CPU Core In-order, 54 cores x 16 modules

Coherence Pt L1d

Conf. Det. L1d, eager?

#TX Reads 16 kB (L1d)

Buffering L1d

Undo Outer cache

#TX Writes 16 kB (L1d)

Abort ?

Changes to baseline

L1d, core

2013-03-27 21Stephan Diestelhorst - HTM

2009 Azul 2009 Rock 2011 BG/Q 2012 zEC12

CPU Core In-order, 54 cores x 16 modules

Semi out-of-order, check-pointed, 2T, 16 cores

Coherence Pt L1d L2

Conf. Det. L1d, eager? L1d R, L2 W, eager

#TX Reads 16 kB (L1d) 32 kB (L1d)

Buffering L1d Store buffer

Undo Outer cache L2

#TX Writes 16 kB (L1d) 32 (StBuf)

Abort ? Requester wins

Changes to baseline

L1d, core L2, (new core)

2013-03-27 22Stephan Diestelhorst - HTM

2009 Azul 2009 Rock 2011 BG/Q 2012 zEC12

CPU Core In-order, 54 cores x 16 modules

Semi out-of-order, check-pointed, 2T, 16 cores

In-order, PowerPC A2, 4T, 16 cores

Coherence Pt L1d L2 Shared L2

Conf. Det. L1d, eager? L1d R, L2 W, eager

L2, eager / lazy

#TX Reads 16 kB (L1d) 32 kB (L1d) 20 MB (L2)

Buffering L1d Store buffer L2

Undo Outer cache L2 L2 (multi-vers)

#TX Writes 16 kB (L1d) 32 (StBuf) 20 MB (L2)

Abort ? Requester wins

SW handler

Changes to baseline

L1d, core L2, (new core) L2, (bypass L1d), OS

2013-03-27 23Stephan Diestelhorst - HTM

2009 Azul 2009 Rock 2011 BG/Q 2012 zEC12

CPU Core In-order, 54 cores x 16 modules

Semi out-of-order, check-pointed, 2T, 16 cores

In-order, PowerPC A2, 4T, 16 cores

Out-of-order, 6 cores x 6 x 4 modules

Coherence Pt L1d L2 Shared L2 L3

Conf. Det. L1d, eager? L1d R, L2 W, eager

L2, eager / lazy

L1d, eager

#TX Reads 16 kB (L1d) 32 kB (L1d) 20 MB (L2) 96 kB (L1d)

Buffering L1d Store buffer L2 Store buffer (WCC)

Undo Outer cache L2 L2 (multi-vers)

L2

#TX Writes 16 kB (L1d) 32 (StBuf) 20 MB (L2) 64 (x128 B, StB)

Abort ? Requester wins

SW handler Req wins + NACK

Changes to baseline

L1d, core L2, (new core) L2, (bypass L1d), OS

L1d, StB, (OS)

2013-03-27 24Stephan Diestelhorst - HTM

HTM Industry Conclusion

• Baseline core and cache microarchitectures differ vastly– Cache sizes, organisation– Coherency point

• TM implementation and performance highly dependent

• Component reuse is key, adapt TM to baseline, not vice versa

2013-03-27 25Stephan Diestelhorst - HTM

Roadmap

• Introduction• Related Work– HTM microarchitecture

• Academia• Industry

– HTM ISA– HTM++ / HTM--– Simulation

• ContributionsMicroarchitecture

RTL

Instruction Set

Libraries

Operating System

Applications

2013-03-27 26Stephan Diestelhorst - HTM

HTM ISA

• General idea simple: TX.begin, TX.commit

• Decisions for aborts: visible [1] / invisible [2], sync / async [1]

• Register snapshotting: none (partial) [1], full [3], selective [2]

• Poke through HTM– No [4] – Default [1]– Special case [5]

[1] Maurice Herlihy, and J. Eliot B. Moss. “Transactional memory: Architectural support for lock-free data structures.” Proceedings of the 20th annual international symposium on computer architecture (ISCA '93). ACM, 1993[2] Jacobi, Christian, Timothy Slegel, and Dan Greiner. "Transactional Memory Architecture and Implementation for IBM System z." Proceedings of the 45th International Symposium on Microarchitecture (MICRO). 2012.[3] Ananian, C. Scott, et al. "Unbounded transactional memory." High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on. IEEE, 2005.[4] Intel Corp. “Intel® Architecture Instruction Set Extensions Programming Reference”, chapter 8, 2012[5] Chaudhry, Shailender, et al. "Rock: A high-performance SPARC CMT processor." Micro, IEEE 29.2 (2009): 6-16.2013-03-27 27Stephan Diestelhorst - HTM

HTM ISA

• OS interaction– Survive faults, interrupts [1]– Support syscalls [3,4] – OS-support required

• Strong / weak isolation• Capacity [1] / progress guarantees [2][1] Ananian, C. Scott, et al. "Unbounded transactional memory." High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on. IEEE, 2005.[2] Jacobi, Christian, Timothy Slegel, and Dan Greiner. "Transactional Memory Architecture and Implementation for IBM System z." Proceedings of the 45th International Symposium on Microarchitecture (MICRO). 2012.[3] Moravan, Michelle J., et al. "Supporting nested transactional memory in LogTM." ACM Sigplan Notices. Vol. 41. No. 11. ACM, 2006.[4] Ramadan, Hany E., et al. "MetaTM/TxLinux: transactional memory for an operating system." ACM SIGARCH Computer Architecture News 35.2 (2007): 92-103.2013-03-27 28Stephan Diestelhorst - HTM

Roadmap

• Introduction• Related Work– HTM microarchitecture

• Academia• Industry

– HTM ISA– HTM++ / HTM--– Simulation

• ContributionsMicroarchitecture

RTL

Instruction Set

Libraries

Operating System

Applications

2013-03-27 30Stephan Diestelhorst - HTM

HTM-- Simpler Hardware

• Reduce HW cost by offering less than full transactions

• AOU [1]• HASTM [2]• Similar: register checkpoints

[1] Spear, Michael F., et al. "Alert-on-update: a communication aid for shared memory multiprocessors." Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, 2007.[2] Saha, Bratin, A-R. Adl-Tabatabai, and Quinn Jacobson. "Architectural support for software transactional memory." Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on. IEEE, 2006.

2013-03-27 31Stephan Diestelhorst - HTM

HTM++ Wider Applicability

• Amortise HW cost by offering more features• Escape actions [1], Suspend / resume [2],

Open nesting [3]• Hardware lock-elision [4]• InvisiFence [5]

[1] Moravan, Michelle J., et al. "Supporting nested transactional memory in LogTM." ACM Sigplan Notices. Vol. 41. No. 11. ACM, 2006.[2] Zilles, Craig, and Lee Baugh. "Extending hardware transactional memory to support non-busy waiting and non-transactional actions." TRANSACT: First ACM SIGPLAN Workshop on Languages, Compilers, and Hardware Support for Transactional Computing. 2006.[3] Moss, J. Eliot B., and Antony L. Hosking. "Nested transactional memory: model and architecture sketches." Science of Computer Programming 63.2 (2006): 186-201.[4] Rajwar, Ravi, and James R. Goodman. "Speculative lock elision: Enabling highly concurrent multithreaded execution." Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture. IEEE Computer Society, 2001.[5] Blundell, Colin, Milo MK Martin, and Thomas F. Wenisch. "InvisiFence: performance-transparent memory ordering in conventional multiprocessors." ACM SIGARCH Computer Architecture News. Vol. 37. No. 3. ACM, 2009.

2013-03-27 32Stephan Diestelhorst - HTM

Roadmap

• Introduction• Related Work– HTM microarchitecture

• Academia• Industry

– HTM ISA– HTM++ / HTM--– Simulation

• ContributionsMicroarchitecture

RTL

Instruction Set

Libraries

Operating System

Applications

2013-03-27 33Stephan Diestelhorst - HTM

Simulation Solutions

• Execution-driven, emulation [1]• In-order / out-of-order cores, extended

memory simulator (Simics + GEMS) SPARC [2]• UVSIM, MIPS64, user-space [3]

• X86? Full-system? Detailed core model?

[1] Maurice Herlihy, and J. Eliot B. Moss. “Transactional memory: Architectural support for lock-free data structures.” Proceedings of the 20th annual international symposium on computer architecture (ISCA '93). ACM, 1993[2] Martin, Milo MK, et al. "Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset." ACM SIGARCH Computer Architecture News 33.4 (2005): 92-99.[3] Zhang, Lixin, and Lixin Zhang. "UVSIM User Manual." (2003).

2013-03-27 34Stephan Diestelhorst - HTM

Roadmap

• Introduction• Related Work– HTM ISA– HTM microarchitecture

• Academia• Industry

– HTM++ / HTM--– Simulation

• ContributionsMicroarchitecture

RTL

Instruction Set

Libraries

Operating System

Applications

2013-03-27 35Stephan Diestelhorst - HTM

Related Work Summary

• Understand realities of baseline microarchitecture

• Constrain TM implementation

• Can we still improve with constraints and allthe previous work?

YES!

2013-03-27 36Stephan Diestelhorst - HTM

Contributions

• Simulator: PTLsim-ASF / Marss86-ASF• HTM ISA: ASF 2.0, ASF++• Solve new problems with HTM: Thread to

thread communication

2013-03-27 37Stephan Diestelhorst - HTM

PTLsim-ASF / Marss86-ASF

PTLsim-ASF [1]• Out-of-order, superscalar core• AMD64 ISA + ASF• Detailed pipeline

implementation of transactional memory primitives

• First order cache model• Multiple options for tracking

working set• Full system, switch

paravirtualisation <-> simulation (Xen-based)

Marss86-ASF [2]• Core derived from PTLsim-

ASF• Detailed cache model

– Controllers, directories, coherence states

– Bandwidth and latency limitations

• QEMU-based, switch emulation <-> simulation

[1] Yourst, Matt T. "PTLsim: A cycle accurate full system x86-64 microarchitectural simulator." Performance Analysis of Systems & Software, 2007. ISPASS 2007. IEEE International Symposium on. IEEE, 2007.[2] Patel, Avadh, et al. "MARSSx86: A full system simulator for x86 CPUs." Proceedings of the 2011 Design Automation Conference. 2011.

2013-03-27 38Stephan Diestelhorst - HTM

ASF - Advanced Synchronization Facility

• Extension to AMD64 ISA for HTM• Key facts– Non-transactional loads and stores– Minimal capacity guarantee (four cache lines)– Requestor-wins conflict resolution, synchronous

aborts– Partial register snapshot (only rIP, rSP)– OS-invisible / Hypervisor-invisible: syscalls,

interrupts, exceptions abort

2013-03-27 39Stephan Diestelhorst - HTM

AMD "Advanced Synchronization Facility" Proposal, 2009http://amddevcentral.com/Resources/archive/ASF/Pages/default.aspx

Using ASF‘s Unique Features

• Accelerate / cooperate with tinySTM

2013-03-27 40Stephan Diestelhorst - HTM

Christie, Dave, et al. "Evaluation of AMD's advanced synchronization facility within a complete transactional memory stack." Proceedings of the 5th European conference on Computer systems. ACM, 2010.

Using ASF‘s Unique Features

• Non-TX memory accesses for well-formed communication channels

• Implement RPC from hardware transactions• And parallel transactional nesting

2013-03-27 41Stephan Diestelhorst - HTM

Liu, Yujie, Stephan Diestelhorst, and Michael Spear. "Delegation and nesting in best-effort hardware transactional memory." Proceedings of the 24th ACM symposium on Parallelism in algorithms and architectures. ACM, 2012.

ASF++ - Extensions to ASF

• Nested abort handling• Supporting time-stamp accesses from

transactions• Non-atomic aborts– Immediate aborts -> asynchronous– Resurrect transactions: tolerate interrupts, syscalls– Map alert-on-update: low latency thread to thread

communication, acceleration of STM• All without expensive hardware changes or

additional OS-visible state

2013-03-27 42Stephan Diestelhorst - HTM

Open Issues

• Microarchitectural enhancements– Tolerate read-after-write conflicts– Enable local transactional stores that never escape

the transaction

• Architectural enhancements: RISC-ified lock elision

• Simulation stability

2013-03-27 43Stephan Diestelhorst - HTM

Conclusion• Transactional memory has high implementation

cost and open design space• I create & use detailed simulation to guide

exploration of microarchitectural and ISA enhancements

• I provide features with lower implementation cost, better performance and wider applicability

• Acknowledgements: AMD (M. Hohmuth, D. Christie, M. Pohlack), TUD (T. Riegel, M. Nowack, JT. Wamhoff), VELOX (UniNE, BSC)

2013-03-27 44Stephan Diestelhorst - HTM

Backup

2013-03-27 Stephan Diestelhorst - HTM 45

Between All and Nothing Aborts

• Basic ASF instructions

• All arch state available at abort

• Coexisting locks / transactions

• No extension of OS-visible state

2013-03-27 46Stephan Diestelhorst - HTM

Processing TrendsJeff Preshing, A Look Back at Single-Threaded CPU Performance, Feb 2008http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance

2013-03-27 47Stephan Diestelhorst - HTM

Transactional Memory 1993• RISC-ify Compare-And-

Swap• Enclose instructions

inside transactions• Execute in parallel watch

for conflicts• New instructions

– Load-transactional– Store-transactional– Commit / Abort– Validate

• New hardware– Separate TX cache,

neighbour to L1D– Keeps undo / redo copy– New bus transaction types– Line holder wins– Size guarantee: 10 – 100

locations

Maurice Herlihy, and J. Eliot B. Moss. “Transactional memory: Architectural support for lock-free data structures.” Proceedings of the 20th annual international symposium on computer architecture (ISCA '93). ACM, 1993

2013-03-27 48Stephan Diestelhorst - HTM

Azul‘s HTM

Baseline microarchitecture• 54 x 16 = 864 coherent cores,

in-order, 2 misses• Private L1i / L1d @ 16kB per

core• L2 shared by 9 cores @ 2MB

– Task-level parallelism• Need to tweak applications to

scale– Usually < 10 cores– Tuning -> 50 cores

• Data contention low, but lock contention high (synchronized)

HTM microarchitecture• All modifications to L1d +

core, no tweaks of L2• TXR & TXW bits in L1d• No reg snapshot• Size limit L1d size &

associativity

Click, Cliff. "Azul’s experiences with hardware transactional memory." HP Labs-Bay Area Workshop on Transactional Memory. 2009.

2013-03-27 49Stephan Diestelhorst - HTM

Azul‘s HTM - Results

• 2x for some (Trade6)• Most < 10% upside• Heuristic (HTM vs lock) is tricky• Small SW rewrites help massively• Most TX fail for conflict, not capacity• Shared counters etc. cause excessive conflicts=> Need breadcrumbs, need failure reporting

Click, Cliff. "Azul’s experiences with hardware transactional memory." HP Labs-Bay Area Workshop on Transactional Memory. 2009.

2013-03-27 50Stephan Diestelhorst - HTM

Oracle / SUN‘s Rock

Chaudhry, Shailender, et al. "Rock: A high-performance SPARC CMT processor." Micro, IEEE 29.2 (2009): 6-16.Dice, Dave, et al. "Early experience with a commercial hardware transactional memory implementation.“ ASPLOS (2009)

Baseline• 16 cores x 2 threads, semi out-

of-order• 4-way 32 kB L1d shared by 2

cores, S-bit for load ordering• 8-way 2 MB L2, shared by 16

cores• 8 MB L3 per memory

controller• Checkpointed core / register

file• Cache miss causes Execute

Ahead phase

HTM• 32 entry store buffer for TX

stores, L2 tracks conflicts• Existing L1d S-bits track TX

read set• Commit locks write-set,

drains store buffer into L2

2013-03-27 51Stephan Diestelhorst - HTM

Oracle / SUN‘s Rock

• Problems with data dependent stores and branch prediction in RB-tree

2013-03-27 52Stephan Diestelhorst - HTM

IBM Blue Gene/Q

Baseline• 16 PowerPC A2 cores x 4

threads, in-order• 8-way 16 kB L1d per core• 16 cores share L2 @ 32 MB

– 16 slices, 16 ways each– Point of coherence– Multi-version

HTM• L2 buffers TX stores, both old /

new in different way• L1 and core mostly unmodified• L2 directory tracks ownership• Short-TX: Push through L1, notify

L2 with store for every load• Long-TX:

– TLB maps versions to different physical addresses,

– Flush L1 on TX start• Small HW register checkpoint: SP,

IP (and Global Offset Table)• MMIO communication

Wang, Amy, et al. "Evaluation of Blue Gene/Q Hardware Support for Transactional Memories." PACT (2012).2013-03-27 53Stephan Diestelhorst - HTM

IBM Blue Gene/Q - Results

Wang, Amy, et al. "Evaluation of Blue Gene/Q Hardware Support for Transactional Memories." PACT (2012).

• L1 causes significant overhead

• OS involvement

2013-03-27 54Stephan Diestelhorst - HTM

IBM Blue Gene/Q - Results

Wang, Amy, et al. "Evaluation of Blue Gene/Q Hardware Support for Transactional Memories." PACT (2012).

• Good scalability• 20 MB working set too small in

labyrinth• Out of IDs in ssca2

2013-03-27 55Stephan Diestelhorst - HTM

IBM zEC12

Baseline• Out-of-order core, three /

seven wide, 6 cores x 20• 6-way 96 kB L1d WT, 3 cyc• 8-way 1 MB WT, 7 cyc• 6 cores share L3 @ 48 MB• 6 x 6 cores share 384 MB L4• 6 x 6 x 4 = 144 coherent

system• Can reject probes

HTM• Keep SMP protocol• Extend fetch unit to track

transaction begin / end• TXR & TXW bits in L1• Gate TX stores before L2 / L3• Flexible

– restore / ignore registers• Constrained mode

– Guaranteed progress– Size limit– Functionality limit

Jacobi, Christian, Timothy Slegel, and Dan Greiner. "Transactional Memory Architecture and Implementation for IBM System z." Proceedings of the 45th International Symposium on Microarchitecture (MICRO). 2012.

2013-03-27 56Stephan Diestelhorst - HTM

IBM zEC12 - Results

Jacobi, Christian, Timothy Slegel, and Dan Greiner. "Transactional Memory Architecture and Implementation for IBM System z." Proceedings of the 45th International Symposium on Microarchitecture (MICRO). 2012.

• Scale beyond MCM size• Constrained mode better @

high contention• TX can cause more traffic

than locks due to aborts, everything in cache at one time

2013-03-27 57Stephan Diestelhorst - HTM

Progress Guarantees

• None, holder wins (TM)• None, requester wins (Rock)

• Hardware timestamps (UTM)• Dependency tracking [1]• Hardware progress for restricted TX [2]

• Software abort control (RTM)[1] Ramadan, Hany E., Christopher J. Rossbach, and Emmett Witchel. "Dependence-aware transactional memory for increased concurrency." Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2008.[2] Jacobi, Christian, Timothy Slegel, and Dan Greiner. "Transactional Memory Architecture and Implementation for IBM System z." Proceedings of the 45th International Symposium on Microarchitecture (MICRO). 2012.2013-03-27 58Stephan Diestelhorst - HTM