19
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software Transactional Memory Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab. University of Michigan

Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

  • Upload
    clive

  • View
    52

  • Download
    0

Embed Size (px)

DESCRIPTION

Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software Transactional Memory. Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab. University of Michigan. Multicore Architectures. Industry wide move to multicore - PowerPoint PPT Presentation

Citation preview

Page 1: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science1

Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost

Software Transactional Memory

Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu,

Scott Mahlke

Advanced Computer Architecture Lab.University of Michigan

Page 2: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

Multicore Architectures

• Industry wide move to multicore– Higher throughput– More power efficient

• Great for parallel programs• Sequential see little benefit

2

Intel 4 Core Nehalem

AMD 4 Core Shanghai Sun Niagara 2 IBM Cell

Page 3: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

[Zhong ‘08]

3

Loop Parallelization

i = 0-39 i = 20-39i = 0-19

No cross-iteration register or memory dependences

Core 1Core 0

Parallelizable loop

Bad news: limited number of parallel loops in general purpose applications

Page 4: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

Loop Parallelization

4

SPECfp

[Zhong ‘08]

Page 5: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science5

Speculative Loop Parallelization

i = 0-39

Pointer?

i = 10-19Pointer?

i = 30-39Pointer?

i = 0-9Pointer?

i = 20-29Pointer?

Core 1Core 0

Loop Chunk

Speculatively parallelizable loop

Memory address isunresolvable statically

Page 6: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

Speculative Loop Parallelization

6

Page 7: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

Supporting Thread Level Speculation

• Execution of speculative loops requires– Conflict detection– Rollback mechanism

• Speculation can be supported by transactional memory– Software is slow– Hardware needs complex structures

• Previous TLS works require hardware– Hydra [Hammond ‘98], Stampede [Steffan ‘98], POSH [Liu ‘06]

7

Page 8: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

Objectives

• Challenge– Can we get speedup supporting speculative loop

parallelization without additional hardware?• Build a specialized software system

– Provide functionality needed for speculation with software transactional memory

– Leverage existing loop parallelization framework from [Zhong ‘08]

– Tightly couple STM with compiler to ensure low overhead

8

Page 9: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

Traditional STM Execution Flow

9

Execute TX TX Commit Writeback WrSet to Memory

Execution Transaction

Start TX End TXWrSetRdSet Consistency

Check

Abort Commit

High Overhead:Validating RdSet

High Overhead:Global Locking

Page 10: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

Ordering Transaction Commit

• TMs typically have no way of controlling commit order

• Loop iterations must commit in original order– Ensures proper rollback

• Requires centralized control to enforce ordering

10

TX 3

TX 1

Core 0

TX 4

TX 2

Core 1

i = 10-19

i = 30-39

i = 0-9

i = 20-29

Page 11: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

STMlite

• Dedicated thread to control commits– Called the Transaction Commit Manager (TCM)– Performs consistency checks for all transactions– Provides point to easily enforce in-order commit

• Bloom-filter based signatures– Hash read and write sets– Similar technique used by HTMs like Bulk [Ceze ‘06]– Low-cost consistency checks during commit

11

Page 12: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

Bloom-Filter Based Signatures

• Constant time insertion and find• Linear time intersection (bitwise OR)

12

Decode

Signature(Bit array)

Address 101 010

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1

101 100

11

Page 13: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

STMlite Execution Flow

13

Execute TX TX Commit Writeback WrSet to Memory

Execution Transaction

Start TX End TXWrSetRdSet Consistency

Check

Abort Commit

WrSigRdSig

Transaction Commit Manager

(TCM)

Wait for Ready

Flag Ready

Ready

Consistency Check

Abort Commit

Page 14: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

Experimental Setup

• Implemented framework in LLVM Compiler• Benchmarks

– Stanford STAMP transactional benchmarks– SPECfp benchmarks

• Run on Sunfire T2000– 8-core UltraSPARC T1 processor

• Baseline STM is Sun’s TL2 [Dice ‘06]

14

Page 15: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

STAMP Benchmarks

15

Page 16: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

SPECfp Benchmarks

16

Page 17: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

Conclusion

• STMlite– Customized for speculative loop parallelization– Transaction commit ordering– Centralized consistency checks– Hashing read/write sets with signatures

• Parallelization of sequential applications is feasible on commodity hardware– Removes much of the slowdown traditionally

associated with STM

17

Page 18: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

Thank You!

Questions?

18

Page 19: Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, Scott Mahlke Advanced Computer Architecture Lab

University of MichiganElectrical Engineering and Computer Science

• Stale entries periodically removed from commit log

Transaction Execution and Commit

19

Transaction Commit Manager (TCM)

Transaction

RdSig Commit LogWrSigWrSigWrSig

End End End

Start WrSigWrSigEnd

Executing Waiting

Ready

Waiting Checking

End

Consistent?Consistent?Consistent?

Commit

WaitingWriteback