30
Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/ Projects/VESPA

Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Embed Size (px)

Citation preview

Page 1: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Combining Thread Level Speculation, Helper Threads,

and Runahead ExecutionPolychronis Xekalakis, Nikolas Ioannou and

Marcelo Cintra

University of Edinburghhttp://www.homepages.inf.ed.ac.uk/mc/

Projects/VESPA

Page 2: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS 2009 2

Introduction

Single core, out-of-order cores don’t scale– Simpler solution: multi-core architectures

No speedup for single thread applications– Use Thread Level Speculation to extract TLP– Use Helper Threads or RunAhead to improve

ILP However for different apps. (or phases)

some models work better than some others Our Proposal:

– Combine these execution models– Decide at runtime when to employ them

Page 3: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS 2009 3

Contributions

Introduce mixed Speculative Multithreading (SM) Execution Models

Design one that combines TLS, HT and RA

Propose a performance model able to quantify ILP and TLP benefits

Unified approach outperforms state-of-the-art SM models:– TLS by 10.2% avg. (up to 41.2%)– RA by 18.3 % avg. (up to 35.2%)

Page 4: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS 2009 4

Outline

Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions

Page 5: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Helper Threads

Compiler deals with:– Memory ops miss/

hard-to-predict branches

– Backward slices

HW deals with:– Spawn threads– Different context– Discard when

finished

Benefit:– ILP

(Prefetch/Warmup) ICS 2009 5

Page 6: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

RunAhead Execution

Compiler deals with:– Nothing

HW deals with:– Different context– When to do RA– VP Memory– Commit/Discard

Benefit:– ILP (Prefetch/Warmup)

ICS 2009 6

Page 7: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS 2009 7

Thread Level Speculation

Compiler deals with:– Task selection– Code generation

HW deals with:– Different context– Spawn threads– Detecting violations– Replaying – Arbitrate commit

Benefit: TLP/ILP– TLP (Overlapped

Execution) + ILP (Prefetching)

Page 8: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS 2009 8

Outline

Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions

Page 9: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS 2009 9

Understanding Performance Benefits Complex TLS thread interactions,

obscure performance benefits Even more true for mixed execution

models We need a way to quantify ILP and TLP

contributions to bottom-line performance

Proposed model:– Able to break benefits in ILP/TLP

contributions

Page 10: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Performance Model

Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)

ICS 2009 10

Tseq/Tmt

Page 11: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Performance Model

Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)2. Compute sequential TLS speedup (Sseq)

ICS 2009 11

Tseq/T1p

Page 12: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Performance Model

Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)2. Compute sequential TLS speedup (Sseq)3. Compute speedup due to ILP (Silp)

ICS 2009 12

(T1+T2)/(T1’+T2’)

Page 13: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Performance Model

Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)2. Compute sequential TLS speedup (Sseq)3. Compute speedup due to ILP (Silp)4. Use everything to compute TLP (Sovl)

ICS 2009 13

Sall/(Sseq x Silp)

Page 14: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS 2009 14

Outline

Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions

Page 15: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Unified Execution Model Can we improve TLS?

1. Some of the threads do not help2. Slack in usage of cores

Improve TLP:– Requires a better compiler

Improve ILP:– Combine TLS with another SM !– Most of the HW common

ICS 2009 15

Page 16: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS 2009 16

Combining TLS, HT and RA

Start with TLS Provide support to clone TLS threads and

convert them to HT Conversion to HT means:

– Put them in RA mode– Suppress squashes and do not cause additional

squashes– Discard them when they finish

No compiler slicing purely HW approach

Page 17: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Intricacies to be Handled HT may not prefetch effectively! Dealing with contention

– HT threads much faster saturate BW

Dealing with thread ordering– TLS imposes total thread order– HT killed squashes TLS threads

ICS 2009 17

Page 18: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Creating and Terminating HT Create a HT on a L2 miss we can VP

– Use mem. address based confidence estimator– VP only if confident

Create a HT if we have a free processor Only allow most speculative thread to clone

– Seamless integration of HT with TLS– BUT: if parent no longer the most spec. TLS

thread, the HT has to be killed Additionally kill HT when:

– Parent/HT thread finishes– HT causes exception

ICS 2009 18

Page 19: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS 2009 19

Outline

Introduction Speculative Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions

Page 20: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS 2009 20

Experimental Setup

Simulator, Compiler and Benchmarks:– SESC (http://sesc.sourceforge.net/)– POSH (Liu et al. PPoPP ‘06)– Spec 2000 Int.

Architecture:– Four way CMP, 4-Issue cores– 16KB L1 Data (multi-versioned) and Instruction Caches– 1MB unified L2 Caches– Inst. window/ROB – 80/104 entries– 16KB Last Value Predictor

Page 21: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS 2009 21

Comparing TLS, RunAhead and Unified Scheme

Page 22: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS 2009 22

Comparing TLS, RunAhead and Unified Scheme

Almost additive benefits

Page 23: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS 2009 23

Comparing TLS, RunAhead and Unified Scheme

Almost additive benefits 10.2% over TLS, 18.3% over RA

Page 24: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Understanding the extra ILP Improvements of ILP come from:

– Mainly memory – Branch prediction (improvement

0.5%) Focus on memory:

– Miss rate on committed path– Clustering of misses (different cost)

ICS 2009 24

Page 25: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Normalized Shared Cache Misses

All schemes better than sequential Unified 41% better than sequential

ICS2009 25

Page 26: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Isolated vs. Clustered Misses

. Both TLS + RA Large window

machines Unified does even better

ICS 2009 26

Page 27: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS 2009 27

Outline

Introduction Multithreading Models Performance Model Unified Scheme Experimental Setup and Results Conclusions

Page 28: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Also on the paper …

Dealing with the load of the system Converting TLS threads to HT Multiple HT Effect of a better VP Detailed comparison of performance

model against existing models (Renau et. al ICS ’05)

ICS 2009 28

Page 29: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

ICS 2009 29

Conclusions

CMPs are here to stay:– What about single threaded apps. and apps with

significant seq. sections? Different apps. require different SM

techniques– Even within apps. different phases

We propose the first mixed execution model– TLS is nicely complemented by HT and RA

Our unified scheme outperforms existing SM models– TLS by 10.2% avg. (up to 41.2%)– RA by 18.3 % avg. (up to 35.2%)

Page 30: Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Combining Thread Level Speculation, Helper Threads,

and Runahead ExecutionPolychronis Xekalakis

Nikolas Ioannou and Marcelo Cintra

University of Edinburghhttp://www.homepages.inf.ed.ac.uk/mc/

Projects/VESPA