Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

Dynamic Performance Tuning for Speculative Threads

Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre†,

Ankit Tarkas†, Wei-Chung Hsu, and Antonia

ZhaiDept. of Computer Science and Engineering†Dept. of Electronic Computer Engineering

University of Minnesota – Twin Cities

2

Motivation

Multicore processors brought forth large potentials of computing power

Exploiting these potentials demands thread-level parallelism

Intel Core i7 die photo

3

Exploiting Thread-Level Parallelism

Potentially more parallelism with speculation

dependence

Sequential

Store *p

Load *q

Store *p

Time

p != q

Thread-Level Speculation (TLS)

Traditional Parallelization

Load *q

p != q ??

Store 88

Load 20

Parallel execution

Store 88

Load 88

Speculation Failure

Time

Time

Load 88

Compiler gives up

p == q

But Unpredictable

4

Addressing UnpredictabilityState-of-the-art approach: Profiling-directed optimization Compiler analysis

Source codeProfiling run

Profiling info

Region selection

Parallel code generation

Feedback

Machine code

A typical compilation framework

Speculative code optimization

5

Problem #1

01

23

3

Time signal

allowcommit

wait

spawning thread

inter-thread synchronization

speculation failure

……load

imbalance

4

Hard to model TLS overhead statically, even with profiling

6

Problem #2

Profiling Behavior

Training Input

Actual Input

Actual Behavior

Profile-based decision may not be appropriate for actual input

7

Problem #3

Timephase1 phase2

Met

ric

phase3

Behavior of the same code changes over time

8

Problem #4

Load *p=0x88

Prefetching

Extra misses

Pollution

Load 0x22 miss

Load 0x22 hit

Load 0x22 Load 0x22miss miss

Sequential Parallel

P P P

How important are cache impacts?

L1$ L1$ L1$

PL1$

Load *p’=0x88

PL1$

PL1$

PL1$

Load *p=0x80

Load *p’=0x08

9

SEQ TLS-4core0

0.2

0.4

0.6

0.8

1

CacheSquashOther

Nor

mal

ized

Exe

cutio

n Ti

me

Impact on Cache Performance

60%

load *p=0x88

Cache: p == p’Prefetched (+)

load *p’=0x88

Cache impact is difficult to model statically

A major loop from benchmark art (scanner.c:594)

2.5X

10

Our proposalProposed Performance Tuning Framework

Compiler Analysis Runtime Analysis

Source code

Profiling info

Region selection

Parallel code generation

Machine code

Speculative code optimization

Accurate Timing & Cache Impacts

Adaptations for inputs and

phase changes

Performance Evaluation

Tuning PolicyCompiler Annotation

RuntimeInformation

Dynamic Execution

Performance Profile Table

11

Parallelization Decisions at Runtime

…

Evaluate performance

Thread spawning End of a parallel execution

spawn

parallelize

spawnspawn

serialize

…

spawn

Parallel execution Sequential execution

Parallel Execution

New execution model facilitates dynamic tuning

Performance Profile Table

12

Evaluating Performance ImpactRuntime Analysis

Simple metric: cost of speculative failure

Comprehensive evaluation: sequential performance prediction

SEQ TLS-4core0

0.2

0.4

0.6

0.8

1

CacheSquashOther

Nor

mal

ized

Exe

cutio

n Ti

me


Tuning Policy

RuntimeInformation

Dynamic ExecutionPerformance Profile Table

13

Sequential Performance Prediction

PredictP P

Programmable Cycle Classifier

Information from ROB

Hardware Counters

iFetch

tick

CacheBusy

ExeStall

SpeculativeParallel

Predicted sequential

Busy

iFetch

ExeStall

Cache

TLSoverhead

Performance Impact

14

Cache Behavior: Scenario #1

TLS

Count the miss in squashed execution

load p(unmodified)

miss

load p hit

Predicted

load p hit

SEQ

load p miss

T0

T0

T0 T0

Squash

15

Cache Behavior: Scenario #2

TLS

load p miss

load p miss

Predicted

load p miss*2

SEQ

load p missstore p

invalidate p

store p

Not count a miss if tag matched invalid status

store p

T0 T0 T0

T0

16

Tune the performance

Exhaustive search

Prioritized search

Runtime Analysis


Tuning Policy

RuntimeInformation

Dynamic ExecutionPerformance Profile Table

17

How to prioritize the searchSearch order: Inside-Out Compiler suggestion Outside-In

Performance summary metric: Speedup or not? Improvement in cycles

Search order

Perf

orm

ance

Su

mm

ary

met

ric

Spee

dup

or n

ot?

Inside-Out Compilersuggestion

Impr

ove.

in c

ycle

sMore

Aggressive

Quantitative+StaticHintQuantitative

Simple

Suboptimal

Evaluation Infrastructure

Benchmarks· SPEC2000 written in C, -O3 optimization

Underlying architecture· 4-core, chip-multiprocessor (CMP)· speculation supported by coherence

Simulator· Superscalar with detailed memory model· simulates communication latency· models bandwidth and contention

Detailed, cycle-accurate simulation

C

C

P

C

P

Interconnect

C

P

C

P

18

19

Comparing Tuning Policies

ammp artbzip

2cra

fty

equake gapgcc gzip mcf

mesaparse

r

perlbmk

twolf

vorte

xvp

r-p vpr-r

G.M.

0

0.5

1

1.5

2

2.5

3

Simple Quantitative Quantitative+StaticHint

Spee

dup

wrt

. SEQ

1.17x

1.23x

1.37x

Parallel Code Overhead

20

Profile-Based vs. Dynamic

ammp artbzip

2cra

fty

equake gapgcc gzip mcf

mesaparse

r

perlbmk

twolf

vorte

xvp

r-p vpr-r

G.M.

0

0.5

1

1.5

2

2.5

3

ProfileBased Quant+StaticHintProfileBased/ProfileBasedSEQ (Quant+StaticHint)/dynamicSEQ

Spee

dup

wrt

. SEQ

Dynamic outperformed static by ≈10% (1.37x/1.25x)

1.25x

Potentially go up to ≈15% (1.46x/1.27x)

1.27x

1.46x

21

Related works: Region Selection

Runtime Loop Detection

[Tubella HPCA’98]

[Marcuello ICS’98]

Compiler Heuristics

[Vijaykumar Micro’98]

Statically Dynamically

Saving Power from Useless Re-spawning

[Renau ICS’05]

Simple Profiling Pass

[Liu PPoPP’06] (POSH)

Extensive Profiling

[Wang LCPC’05]

Balanced Min-Cut

[Johnson PLDI’04]

Profile-Based Search

[Johnson PPoPP’07]

Lack of adaptability Lack of high-levelinformation

Dynamic Performance Tuning[Luo ISCA’09] (this work)

22

Conclusions

Deciding how to extract speculative threads at runtimeQuantitatively summarize performance profileCompiler hints avoid suboptimal decisions

Compiler and hardware work together. 37% speedup over sequential execution ≈10% speedup over profile-based decisions

Dynamic performance tuning can improve TLS efficiency

Configurable HPMs enable efficient dynamic optimizations

Documents

Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept