81
1 A Scalable Approach to Thread-Level Speculation Steffan Carnegie Mellon A Scalable Approach to A Scalable Approach to Thread-Level Speculation Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Antonia Zhai, and Todd C. Mowry Computer Science Department Computer Science Department Carnegie Mellon University Carnegie Mellon University

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

Embed Size (px)

DESCRIPTION

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Computer Science Department Carnegie Mellon University. P. P. P. P. C. C. C. C. C. C. C. Shared Memory. Multithreaded Machines Are Everywhere. Threads. - PowerPoint PPT Presentation

Citation preview

Page 1: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

1A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

A Scalable Approach to A Scalable Approach to

Thread-Level SpeculationThread-Level Speculation

J. Gregory Steffan, Christopher B. Colohan, J. Gregory Steffan, Christopher B. Colohan,

Antonia Zhai, and Todd C. MowryAntonia Zhai, and Todd C. Mowry

Computer Science DepartmentComputer Science Department

Carnegie Mellon UniversityCarnegie Mellon University

Page 2: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

2A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Multithreaded Machines Are EverywhereMultithreaded Machines Are Everywhere

How can we use them? Parallelism!

C

P

C

C

P

C

C

P

C

Shared Memory

SUN MAJC,IBM Power4

ALPHA 21464 Dual Pentium SGI Origin

Threads

C

P

C

C

P

C

Shared MemoryC

C

P

C

P

Page 3: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

3A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Automatic ParallelizationAutomatic Parallelization

Proving independence of threads is hard:Proving independence of threads is hard:

– complex control flowcomplex control flow

– complex data structurescomplex data structures

– pointers, pointers, pointerspointers, pointers, pointers

– run-time inputsrun-time inputs

How can we make the compiler’s job feasible?How can we make the compiler’s job feasible?

Thread-Level Speculation (TLS)

Page 4: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

4A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

ExampleExample

while (...){

x = hash[index1]; … hash[index2] = y; ...

}

Time= hash[3]…hash[10] =…

Processor

= hash[19]…hash[21] =…

= hash[33]…hash[30] =…

= hash[10]…hash[25] =…

Page 5: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

5A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Example of Thread-Level SpeculationExample of Thread-Level Speculation

Time

= hash[3]…hash[10] =…

Epoch 1

= hash[19]…hash[21] =…

Epoch 2

= hash[33]…hash[30] =…

Epoch 3

= hash[10]…hash[25] =…

Epoch 4

Processor Processor Processor Processor

Page 6: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

6A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Example of Thread-Level SpeculationExample of Thread-Level Speculation

Time

= hash[3]…hash[10] =…

Epoch 1

= hash[19]…hash[21] =…

Epoch 2

= hash[33]…hash[30] =…

Epoch 3

= hash[10]…hash[25] =…

Epoch 4

Processor Processor Processor Processor

Violation!

Page 7: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

7A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Example of Thread-Level SpeculationExample of Thread-Level Speculation

Time

= hash[3]…hash[10] =…commit?

Epoch 1

= hash[19]…hash[21] =…commit?

Epoch 2

= hash[33]…hash[30] =…commit?

Epoch 3

= hash[10]…hash[25] =…commit?

Epoch 4

Processor Processor Processor Processor

Violation!

Page 8: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

8A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Example of Thread-Level SpeculationExample of Thread-Level Speculation

Time

= hash[3]…hash[10] =…commit?

Epoch 1

= hash[19]…hash[21] =…commit?

Epoch 2

= hash[33]…hash[30] =…commit?

Epoch 3

= hash[10]…hash[25] =…commit?

Epoch 4

Processor Processor Processor Processor

Violation!

= hash[10]…hash[25] =…commit?

Epoch 4Retry

Page 9: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

9A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Goals of Our ApproachGoals of Our Approach

1) Handle arbitrary memory accesses1) Handle arbitrary memory accesses

– i.e. not just array referencesi.e. not just array references

2) Preserve performance of non-speculative workloads2) Preserve performance of non-speculative workloads

– keep hardware support minimal and simplekeep hardware support minimal and simple

3) Apply to any scale of multithreaded architecture3) Apply to any scale of multithreaded architecture

– CMPs, SMT processors, more traditional MPsCMPs, SMT processors, more traditional MPs

effective, simple, and scalable TLS

Page 10: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

10A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Overview of Our ApproachOverview of Our Approach

System requirements:System requirements:

1) Detect data dependence violations1) Detect data dependence violations • extend invalidation-based cache coherenceextend invalidation-based cache coherence

2) Buffer speculative modifications2) Buffer speculative modifications• use the caches as speculative buffersuse the caches as speculative buffers

coherence already works at a variety of scales

hence our scheme is also scalable

Page 11: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

11A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Related SchemesRelated Schemes

• Wisconsin (Multiscalar, Trace Processor)Wisconsin (Multiscalar, Trace Processor)

• Stanford (Hydra)Stanford (Hydra)

• U.P. Catalunya (Speculative Multithreading)U.P. Catalunya (Speculative Multithreading)

• Intel/U. Portland (Dynamic Multithreading)Intel/U. Portland (Dynamic Multithreading)

• Illinois at U.C. (I-ACOMA)Illinois at U.C. (I-ACOMA)

our approach seamlessly scales both up and down

Page 12: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

12A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

OutlineOutline

Details of our ApproachDetails of our Approach

– life cycle of an epochlife cycle of an epoch

– speculative coherence speculative coherence

– what happens at commit timewhat happens at commit time

– forwarding data between epochsforwarding data between epochs

• PerformancePerformance

• ConclusionsConclusions

Page 13: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

13A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Life Cycle of an EpochLife Cycle of an Epoch

Spawned

BecomesSpeculative

Commit?

Init

SpeculativeWork

Wait to beHomefree?

Slow Commit:

Fast Commit:

Complete,Pass Homefree

Time

Page 14: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

14A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Life Cycle of an EpochLife Cycle of an Epoch

Spawned

BecomesSpeculative

Commit?

SpeculativeCoherence

Complete,Pass Homefree

Time

to Squashor Commit

Mechanisms

Page 15: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

15A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

MESI Coherence ExampleMESI Coherence Example

Shared Memory (X=2)

Cache

Processor

-

Tag

Invalid

State

-

Data

Thread A:

Cache

Processor

-

Tag

Invalid

State

-

Data

Thread B:

Page 16: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

16A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

MESI Coherence ExampleMESI Coherence Example

Shared Memory (X=2)

Cache

Processor

-

Tag

Invalid

State

-

DataCache

Processor

-

Tag

Invalid

State

-

Data

Load X

Read

Thread A: Thread B:

Page 17: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

17A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

MESI Coherence ExampleMESI Coherence Example

Shared Memory (X=2)

Cache

Processor

-

Tag

Invalid

State

-

DataCache

Processor

X

Tag

Excl.

State

2

Data

Fill

Load XThread A: Thread B:

Read

Page 18: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

18A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

MESI Coherence ExampleMESI Coherence Example

Shared Memory (X=2)

Cache

Processor

-

Tag

Invalid

State

-

DataCache

Processor

X

Tag

Excl.

State

2

Data

Read-Exclusive

Load XStore X=3

read-exclusive invalidates all other copies

Thread A: Thread B:

Page 19: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

19A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

MESI Coherence ExampleMESI Coherence Example

Shared Memory (X=2)

Cache

Processor

-

Tag

Invalid

State

-

DataCache

Processor

-

Tag

Invalid

State

-

Data

Load XStore X=3

read-exclusive invalidates all other copies

Thread A: Thread B:

Read-Exclusive Invalidation

Page 20: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

20A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

MESI Coherence ExampleMESI Coherence Example

Shared Memory (X )

Cache

Processor

X

Tag

Dirty

State

3

DataCache

Processor

-

Tag

Invalid

State

-

Data

Load XStore X=3

the state ‘dirty’ implies exclusiveness

Fill

Thread A: Thread B:

InvalidationRead-Exclusive

Page 21: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

21A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Speculative Coherence ExampleSpeculative Coherence Example

Highlights of our scheme:Highlights of our scheme:

– detection of a data dependence violationdetection of a data dependence violation

– speculatively modifiedspeculatively modified andand sharedshared cache lines cache lines

Epoch5: Epoch6:Load X

Epoch4:

Store X=3Load X

Page 22: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

22A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Speculative Coherence ExampleSpeculative Coherence Example

Shared Memory (X=2)

Cache

Processor

-

Tag

Invalid

State

-

Data

Epoch5:

Cache

Processor

-

Tag

Invalid

State

-

Data

Epoch6:Load X

Read

Page 23: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

23A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Speculative Coherence ExampleSpeculative Coherence Example

Shared Memory (X=2)

Cache

Processor

-

Tag

Invalid

State

-

Data

Epoch5:

Cache

Processor

X

Tag

Excl.

State

2

Data

Epoch6:Load X

Fill

Spec.Loaded

track which lines are speculatively loaded

Read

Page 24: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

24A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Speculative Coherence ExampleSpeculative Coherence Example

Shared Memory (X=2)

Cache

Processor

-

Tag

Invalid

State

-

Data

Epoch5:

Cache

Processor

X

Tag

Excl.

State

2

Data

Epoch6:Load X

Spec.Loaded

Store X=3

Sp Read-Ex (epoch5)

speculative msgs piggyback epoch number

Page 25: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

25A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Speculative Coherence ExampleSpeculative Coherence Example

Shared Memory (X=2)

Cache

Processor

-

Tag

Invalid

State

-

Data

Epoch5:

Cache

Processor

X

Tag

Excl.

State

2

Data

Epoch6:Load X

Spec.Loaded

Store X=3

Sp Inv (epoch5)

epoch5 < epoch6, and speculatively loaded

Sp Read-Ex (epoch5)

Page 26: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

26A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Speculative Coherence ExampleSpeculative Coherence Example

Shared Memory (X=2)

Cache

Processor

-

Tag

Invalid

State

-

Data

Epoch5:

Cache

Processor

-

Tag

Invalid

State

-

Data

Epoch6:Load X

Store X=3 speculation failed!

speculation fails for epoch 6

Sp Inv (epoch5)Sp Read-Ex (epoch5)

Page 27: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

27A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Speculative Coherence ExampleSpeculative Coherence Example

Shared Memory (X=2)

Cache

Processor

X

Tag

Excl.

State

3

Data

Epoch5: Store X=3

Fill

Spec.Modified

track which lines are speculatively modified

Cache

Processor

-

Tag

Invalid

State

-

Data

Epoch6:Load X

speculation failed!

Sp Inv (epoch5)Sp Read-Ex (epoch5)

Page 28: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

28A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Speculative Coherence ExampleSpeculative Coherence Example

Highlights of our scheme:Highlights of our scheme:

– detection of a data dependence violationdetection of a data dependence violation

– speculatively modifiedspeculatively modified andand sharedshared cache lines cache lines

Epoch5: Epoch6:Load X

Epoch4:

Store X=3Load X

Page 29: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

29A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Speculative Coherence ExampleSpeculative Coherence Example

Shared Memory (X=2)

Cache

Processor

-

Tag

Invalid

State

-

Data

Epoch4:

Cache

Processor

X

Tag

Excl.

State

3

Data

Epoch5: Store X=3

Spec.Modified

Page 30: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

30A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Speculative Coherence ExampleSpeculative Coherence Example

Shared Memory (X=2)

Cache

Processor

-

Tag

Invalid

State

-

Data

Epoch4:

Cache

Processor

X

Tag

Excl.

State

3

Data

Epoch5: Store X=3

Spec.Modified

Load X

Read

Page 31: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

31A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Speculative Coherence ExampleSpeculative Coherence Example

Shared Memory (X=2)

Cache

Processor

-

Tag

Invalid

State

-

Data

Epoch4:

Cache

Processor

X

TagState

3

Data

Epoch5: Store X=3

Spec.Modified

Load X

notify shared

Shared

both speculatively modified and shared!

Read

Page 32: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

32A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Speculative Coherence ExampleSpeculative Coherence Example

Shared Memory (X=2)

Cache

Processor

X

TagState

2

Data

Epoch4:

Cache

Processor

X

TagState

3

Data

Epoch5: Store X=3

Spec.Modified

Load X

Shared

multiple versions of the same cache line

Fill

SharedSpec.

Loaded

Read notify shared

Page 33: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

33A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Summary of New Speculative Line StateSummary of New Speculative Line State

New cache line state:New cache line state:

– has it been has it been speculatively loadedspeculatively loaded??

• detect dependence violationsdetect dependence violations

– has it been has it been speculatively modifiedspeculatively modified??

• buffer speculative modificationsbuffer speculative modifications

– is it in a is it in a speculative speculative sharedshared or or exclusiveexclusive state? state?

• important performance optimizationsimportant performance optimizations

What if a speculative cache line is replaced?What if a speculative cache line is replaced?

– speculation fails for that epochspeculation fails for that epoch

Page 34: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

34A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Implementation of Speculative StateImplementation of Speculative State

Cache

Processor

TagState Data

-- -

-- -

-- -

-- -

Page 35: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

35A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Implementation of Speculative StateImplementation of Speculative State

Cache

Processor

State Data

- -

- -

- -

Tag

-

-

-

-- -

SL

-

-

-

-

SM

-

-

-

-

SpeculativelyModified

SpeculativelyLoaded

modest amount of extra space

Page 36: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

36A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Life Cycle of an EpochLife Cycle of an Epoch

Spawned

BecomesSpeculative

Commit?

SpeculativeCoherence

Complete,Pass Homefree

Time

to Squashor Commit

Mechanisms Squash

Page 37: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

37A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

When Speculation FailsWhen Speculation Fails

Cache

Processor

State Data

Sp Ex *

Sp Sh *

Sp Ex *

Tag

*

*

*

*Sp Sh *

SL

1

1

0

1

SM

0

0

1

1

Flash

Reset

Page 38: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

38A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

When Speculation FailsWhen Speculation Fails

Cache

Processor

State Data

Excl *

*

Sp Ex *

Tag

*

*

*

* *

SL

0

0

0

0

SM

0

0

1

1

Shared

Sp Sh

If Set then

Invalidate;

Flash Reset

Page 39: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

39A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

When Speculation FailsWhen Speculation Fails

Cache

Processor

State Data

Excl *

*

Invalid *

Tag

*

*

*

*Invalid *

SL

0

0

0

0

SM

0

0

0

0

quick bit operation

Shared

Page 40: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

40A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Life Cycle of an EpochLife Cycle of an Epoch

Spawned

BecomesSpeculative

Commit?

SpeculativeCoherence

Complete,Pass Homefree

Time

to Squashor Commit

Mechanisms

Commit

Page 41: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

41A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

When Speculation SucceedsWhen Speculation Succeeds

Cache

Processor

State Data

Sp Ex *

Sp Sh *

Sp Ex *

Tag

*

*

*

*Sp Sh *

SL

1

1

0

1

SM

0

0

1

1

Flash

Reset

Page 42: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

42A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

When Speculation SucceedsWhen Speculation Succeeds

Cache

Processor

State Data

Excl *

*

Sp Ex *

Tag

*

*

*

*Sp Sh *

SL

0

0

0

0

SM

0

0

1

1

SharedSM & Exclusive:

Become Dirty

Page 43: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

43A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

When Speculation SucceedsWhen Speculation Succeeds

Cache

Processor

State Data

Excl *

*

Sp Ex *

Tag

*

*

*

*Sp Sh *

SL

0

0

0

0

SM

0

0

1

1

SharedSM & Shared:

Need Exclusive

Access

want to avoid searching entire cache

Page 44: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

44A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

When Speculation SucceedsWhen Speculation Succeeds

Cache

Processor

State Data

Excl *

*

Sp Ex *

Tag

*

*

*

XSp Sh *

SL

0

0

0

0

SM

0

0

1

1

Shared

ownership required buffer (ORB)

-

-

X

ORB

Page 45: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

45A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

When Speculation SucceedsWhen Speculation Succeeds

Cache

Processor

State Data

Excl *

*

Sp Ex *

Tag

*

*

*

XSp Sh *

SL

0

0

0

0

SM

0

0

1

1

Shared

Upgrade-Request (X)

-

-

X

ORB

Page 46: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

46A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

When Speculation SucceedsWhen Speculation Succeeds

Cache

Processor

State Data

Excl *

*

Sp Ex *

Tag

*

*

*

XSp Sh *

SL

0

0

0

0

SM

0

0

1

1

Shared

Ack (X)

-

-

-

ORB

If SM,

Become Dirty;

Flash Reset

Upgrade-Request (X)

Page 47: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

47A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

When Speculation SucceedsWhen Speculation Succeeds

Cache

Processor

State Data

Excl *

*

Dirty *

Tag

*

*

*

XDirty *

SL

0

0

0

0

SM

0

0

0

0

Shared -

-

-

ORB

flush the ORB, then quick bit operations

Page 48: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

48A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Forwarding Data Between EpochsForwarding Data Between Epochs

• predictable dependences cause frequent violationspredictable dependences cause frequent violations

• compiler inserts wait-signal synchronizationcompiler inserts wait-signal synchronization

Store XLoad X

synchronize to avoid violations

Wait

ForwardingWith

Store XSignal

Load X

Page 49: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

49A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

OutlineOutline

• Details of our ApproachDetails of our Approach

PerformancePerformance

– simulation infrastructuresimulation infrastructure

– single-chip multiprocessor performancesingle-chip multiprocessor performance

– scaling beyond chip boundariesscaling beyond chip boundaries

• ConclusionsConclusions

Page 50: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

50A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Simulation InfrastructureSimulation Infrastructure

Compiler system and tools based on SUIFCompiler system and tools based on SUIF

– help analyze dependences, insert synchronizationhelp analyze dependences, insert synchronization

– produce produce MIPSMIPS binaries containing TLS primitives binaries containing TLS primitives

Benchmarks (all run to completion)Benchmarks (all run to completion)

– buk, compress95, ijpeg, equakebuk, compress95, ijpeg, equake

SimulatorSimulator

– superscalar, similar to superscalar, similar to MIPS R10KMIPS R10K

– models all bandwidth and contention models all bandwidth and contention

detailed simulation!C

C

P

C

P

Crossbar

Page 51: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

51A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Performance on a 4-Processor CMPPerformance on a 4-Processor CMP

2.26

1.27

1.771.94

0

0.5

1

1.5

2

2.5

buk compress95 equake ijpeg

Sp

ee

du

p

Region

56.6% 47.3% 39.3% 22.1%Parallel

Coverage:

Page 52: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

52A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Performance on a 4-Processor CMPPerformance on a 4-Processor CMP

2.26

1.27

1.771.94

1.46

1.12 1.211.08

0

0.5

1

1.5

2

2.5

buk compress95 equake ijpeg

Sp

ee

du

p

Region

Program

program speedups are limited by coverage

56.6% 47.3% 39.3% 22.1%Parallel

Coverage:

Page 53: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

53A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Varying the Number of ProcessorsVarying the Number of ProcessorsN

orm

aliz

ed R

egio

n E

xecu

tio

n T

ime

buk and equake are memory-bound

compress95 and ijpeg are computation-intensive

Page 54: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

54A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Varying the Number of ProcessorsVarying the Number of ProcessorsN

orm

aliz

ed R

egio

n E

xecu

tio

n T

ime

buk and equake scale well

passing the homefree token is not a bottleneck

Page 55: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

55A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Performance of the ORB (on a 4-CMP) Performance of the ORB (on a 4-CMP)

ApplicationApplication

Average Average Flush Flush

Latency Latency (cycles)(cycles)

ORB Size (entries)ORB Size (entries)

AverageAverage MaximumMaximum

bukbuk 13.9513.95 2.382.38 99

compress95compress95 0.040.04 0.010.01 88

equakeequake 0.130.13 0.040.04 1212

ijpegijpeg 1.061.06 0.170.17 55

a small ORB is sufficient

Page 56: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

56A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Tracking Dependences Per Cache LineTracking Dependences Per Cache Line

Problem:Problem:

– analagous to false sharing: false violationsanalagous to false sharing: false violations

– write-after-write dependences also cause violationswrite-after-write dependences also cause violations

• but not a true dependence!but not a true dependence!

Solution:Solution:

– track dependences at a word granularitytrack dependences at a word granularity

is per-word state worth the extra overhead?

Page 57: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

57A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Tracking Dependences Per Cache LineTracking Dependences Per Cache Line

Does it do any good?Does it do any good?

– not for our 4 benchmarksnot for our 4 benchmarks

– adding this support showed no improvementadding this support showed no improvement

Why not?Why not?

– buk and equake have random access patternsbuk and equake have random access patterns

– compress95 is heavily synchronizedcompress95 is heavily synchronized

– ijpeg is unrolled to avoid false sharingijpeg is unrolled to avoid false sharing

existing techniques for avoiding false sharing

can address this problem

Page 58: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

58A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Scaling Beyond Chip BoundariesScaling Beyond Chip Boundaries

Shared Memory

C

C

P

C

P

Crossbar

C

C

P

C

P

Crossbar

Node Node

200 Cycles

simulate architectures with 1, 2 and 4 nodes

Page 59: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

59A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Scaling Beyond Chip BoundariesScaling Beyond Chip BoundariesN

orm

aliz

ed R

egio

n E

xecu

tio

n T

ime

multi-chip systems benefit from TLS

Page 60: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

60A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Scaling Beyond Chip BoundariesScaling Beyond Chip BoundariesN

orm

aliz

ed R

egio

n E

xecu

tio

n T

ime

our scheme scales well

Page 61: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

61A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

ConclusionsConclusions

The overheads of our scheme are low:The overheads of our scheme are low:

– mechanisms to squash or commit are not a bottleneckmechanisms to squash or commit are not a bottleneck

– per-word speculative state is not always necessaryper-word speculative state is not always necessary

It offers compelling performance improvements:It offers compelling performance improvements:

– program speedups from 8% to 46% on a 4-processor program speedups from 8% to 46% on a 4-processor CMPCMP

– program speedups up to 75% on multi-chip architecturesprogram speedups up to 75% on multi-chip architectures

It is scalable:It is scalable:

– coherence provides elegant data dependence trackingcoherence provides elegant data dependence tracking

seamless TLS on a wide range of architectures

Page 62: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

62A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Backup SlidesBackup Slides

Page 63: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

63A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

The I-ACOMA Scalable ApproachThe I-ACOMA Scalable Approach

The I-ACOMA approach is hierarchicalThe I-ACOMA approach is hierarchical

– Memory Disambiguation Table (MDT)Memory Disambiguation Table (MDT)• structure used to detect data dependence violationsstructure used to detect data dependence violations

– scalable hardware support using a hierarchy of MDTsscalable hardware support using a hierarchy of MDTs

– hierarchical ordering of threadshierarchical ordering of threads• one level inside each multiprocessor chipone level inside each multiprocessor chip

• another level across chipsanother level across chips

Our approach is flatOur approach is flat

– speculation occurs along a flat speculation levelspeculation occurs along a flat speculation level

Page 64: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

64A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

C

C

C

C

C

C

Underlying ArchitectureUnderlying Architecture

Interconnection Network

M

C

M

C

M

C

M

C

M

C

M

C

C

P

C

C

P

C

P P

focus on the level where coherence begins

speculation

level

Page 65: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

65A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Underlying ArchitectureUnderlying Architecture

Shared Memory

P P P P

C Cspeculation

level

focus on the level where coherence begins

Page 66: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

66A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Speculation in a Shared CacheSpeculation in a Shared Cache

Why?Why?

1) Shared-cache multithreaded architectures1) Shared-cache multithreaded architectures

• eg. eg. simultaneous multithreadingsimultaneous multithreading

2) Context switch to another chain of speculation2) Context switch to another chain of speculation

3) Start new epoch while current epoch waits to commit3) Start new epoch while current epoch waits to commit

How?How?

replicate the speculative context

Page 67: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

67A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Support for Speculation in a Shared CacheSupport for Speculation in a Shared Cache

replicate the speculative context

Cache

Processor

State Data

- -

-

- -

Tag

-

-

-

-- -

SL

-

-

-

-

SM

-

-

-

-

-

ORB

Page 68: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

68A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Support for Speculation in a Shared CacheSupport for Speculation in a Shared Cache

Cache

Processor

State Data

- -

-

- -

Tag

-

-

-

-- -

SL

-

-

-

-

SM

-

-

-

-

-

ORBSL

-

-

-

-

SM

-

-

-

-

ORB

replicate the speculative context

Page 69: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

69A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Preserving CorrectnessPreserving Correctness

Speculation must fail whenever speculative state is lostSpeculation must fail whenever speculative state is lost

– eg., replacement of a speculative line, ORB eg., replacement of a speculative line, ORB overflowoverflow

Any exceptions are suppressed until epoch is homefreeAny exceptions are suppressed until epoch is homefree

– eg., divide by zero, segfaulteg., divide by zero, segfault

Polling violation detection must avoid infinite loopingPolling violation detection must avoid infinite looping

– requires a poll inside each looprequires a poll inside each loop

No system calls while speculative (for now)No system calls while speculative (for now)

ensures original sequential semantics are preserved

Page 70: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

70A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Epoch NumbersEpoch Numbers

Represent a partial orderingRepresent a partial ordering

– signed-compare sequence numbers if TIDs matchsigned-compare sequence numbers if TIDs match

• allows for wrap-aroundallows for wrap-around

– otherwise the epochs are unorderedotherwise the epochs are unordered

• from independent programs from independent programs

• from independent chains of speculation within one from independent chains of speculation within one programprogram

Thread Identifier (TID) Sequence Number

Page 71: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

71A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Speculative Thread ModelSpeculative Thread Model

Round-robin schedule of epochs to processorsRound-robin schedule of epochs to processors

– not a requirement of our scheme, just for conveniencenot a requirement of our scheme, just for convenience

Each epoch spawns the next Each epoch spawns the next

– through a lightweight fork instruction through a lightweight fork instruction

Violations detected through pollingViolations detected through polling

– each epoch runs to completion before detecting failed each epoch runs to completion before detecting failed speculation and restartingspeculation and restarting

Violation chainingViolation chaining

– if an epoch suffers a violation, we squash all logically-later if an epoch suffers a violation, we squash all logically-later epochsepochs

many possibilities to be evaluated in future work

Page 72: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

72A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Multiple Writers ExampleMultiple Writers ExampleOriginal

0 0 0 0A B C D

SM[]

Data

Epoch i+1

1 0 1 0G B H D

SM[]

Data

Committed

0 0 0 0G B H F

SM[]

Data

Epoch i

1 0 0 1E B C F

SM[]

Data

combine speculatively modified lines at commit time

Page 73: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

73A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Pipeline ParametersPipeline Parameters

Issue WidthIssue Width 44

Functional UnitsFunctional Units 2Int, 2FP, 1Mem, 1Bra2Int, 2FP, 1Mem, 1Bra

Reorder Buffer SizeReorder Buffer Size 3232

Integer MultiplyInteger Multiply 12 cycles12 cycles

Integer DivideInteger Divide 76 cycles76 cycles

All Other IntegerAll Other Integer 1 cycle1 cycle

FP DivideFP Divide 15 cycles15 cycles

FP Square RootFP Square Root 20 cycles20 cycles

All Other FPAll Other FP 2 cycles2 cycles

Branch PredictionBranch Prediction GShare (16KB, 8 history bits)GShare (16KB, 8 history bits)

Page 74: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

74A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Memory ParametersMemory Parameters

Cache Line SizeCache Line Size 32B32B

Instruction CacheInstruction Cache 32KB, 4-way set-assoc32KB, 4-way set-assoc

Data CacheData Cache 32KB, 2-way set-assoc, 2 banks32KB, 2-way set-assoc, 2 banks

Unified Secondary CacheUnified Secondary Cache 2MB, 4-way set-assoc, 4 banks 2MB, 4-way set-assoc, 4 banks

Miss HandlersMiss Handlers 8 for data, 2 for insts8 for data, 2 for insts

Crossbar InterconnectCrossbar Interconnect 8B per cycle per bank8B per cycle per bank

Minimum Miss Latency to Minimum Miss Latency to Secondary CacheSecondary Cache

10 cycles10 cycles

Minimum Miss Latency to Local Minimum Miss Latency to Local MemoryMemory

75 cycles75 cycles

Main Memory BandwidthMain Memory Bandwidth 1 access per 20 cycles1 access per 20 cycles

Intra-Chip Communication LatencyIntra-Chip Communication Latency 10 cycles10 cycles

Inter-Chip Communication LatencyInter-Chip Communication Latency 200 cycles200 cycles

Page 75: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

75A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Benchmark Details: Regions and EpochsBenchmark Details: Regions and Epochs

ApplicationApplicationUnrolling Unrolling FactorFactor

Avg. Insts. Avg. Insts. per Epochper Epoch

Parallel Parallel CoverageCoverage

bukbuk 88 81.081.0 22.8%22.8%

88 135.0135.0 33.8%33.8%

compress95compress95 11 196.7196.7 24.6%24.6%

11 240.4240.4 22.7%22.7%

ijpegijpeg 3232 1467.91467.9 8.2%8.2%

11 80.880.8 2.2%2.2%

11 84.084.0 5.0%5.0%

11 100.3100.3 6.7%6.7%

equakeequake 11 2925.52925.5 39.3%39.3%

Page 76: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

76A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Performance on a 4-Processor CMPPerformance on a 4-Processor CMP

ApplicationApplication

Overall Overall Region Region

SpeedupSpeedupParallel Parallel

CoverageCoverageProgram Program SpeedupSpeedup

bukbuk 2.262.26 56.6%56.6% 1.461.46

compress95compress95 1.271.27 47.3%47.3% 1.121.12

equakeequake 1.771.77 39.3%39.3% 1.211.21

ijpegijpeg 1.941.94 22.1%22.1% 1.081.08

program speedups are limited by coverage

Page 77: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

77A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Varying the Number of Processors Varying the Number of Processors N

orm

aliz

ed A

gg

reg

ate

Cyc

les

in R

egio

ns

Page 78: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

78A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

TLS OverheadsTLS Overheads

ApplicationApplication

Dynamic Dynamic Instruction Instruction OverheadOverhead

Misses to Other Misses to Other CachesCaches

bukbuk 5.3%5.3% 34.47%34.47%

compress95compress95 30.6%30.6% 3.02%3.02%

equakeequake 3.7%3.7% 1.67%1.67%

ijpegijpeg 7.0%7.0% 65.00%65.00%

buk and ijpeg can benefit greatly from improved locality

Page 79: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

79A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Violation StatisticsViolation StatisticsP

erce

nt

of

Vio

lati

on

s

speculative invalidation gives early notice of a violation

Page 80: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

80A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Impact of Communication LatencyImpact of Communication LatencyN

orm

aliz

ed R

egio

n E

xecu

tio

n T

ime

speedups still possible with higher latencies

Page 81: A Scalable Approach to  Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

81A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

Speculative Invalidation of Non-Speculative Speculative Invalidation of Non-Speculative Cache LinesCache Lines

No

rmal

ized

Reg

ion

Exe

cuti

on

Tim

e

a worthwhile enhancement of our baseline scheme