22
Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre , Ankit Tarkas , Wei-Chung Hsu, and Antonia Zhai Dept. of Computer Science and Engineering Dept. of Electronic Computer Engineering University of Minnesota – Twin Cities

Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

Embed Size (px)

Citation preview

Page 1: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

Dynamic Performance Tuning for Speculative Threads

Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre†,

Ankit Tarkas†, Wei-Chung Hsu, and Antonia

ZhaiDept. of Computer Science and Engineering†Dept. of Electronic Computer Engineering

University of Minnesota – Twin Cities

Page 2: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

2

Motivation

Multicore processors brought forth large potentials of computing power

Exploiting these potentials demands thread-level parallelism

Intel Core i7 die photo

Page 3: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

3

Exploiting Thread-Level Parallelism

Potentially more parallelism with speculation

dependence

Sequential

Store *p

Load *q

Store *p

Time

p != q

Thread-Level Speculation (TLS)

Traditional Parallelization

Load *q

p != q ??

Store 88

Load 20

Parallel execution

Store 88

Load 88

Speculation Failure

Time

Time

Load 88

Compiler gives up

p == q

But Unpredictable

Page 4: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

4

Addressing UnpredictabilityState-of-the-art approach: Profiling-directed optimization Compiler analysis

Source codeProfiling run

Profiling info

Region selection

Parallel code generation

Feedback

Machine code

A typical compilation framework

Speculative code optimization

Page 5: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

5

Problem #1

01

23

3

Time signal

allowcommit

wait

spawning thread

inter-thread synchronization

speculation failure

……load

imbalance

4

Hard to model TLS overhead statically, even with profiling

Page 6: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

6

Problem #2

Profiling Behavior

Training Input

Actual Input

Actual Behavior

Profile-based decision may not be appropriate for actual input

Page 7: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

7

Problem #3

Timephase1 phase2

Met

ric

phase3

Behavior of the same code changes over time

Page 8: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

8

Problem #4

Load *p=0x88

Prefetching

Extra misses

Pollution

Load 0x22 miss

Load 0x22 hit

Load 0x22 Load 0x22miss miss

Sequential Parallel

P P P

How important are cache impacts?

L1$ L1$ L1$

PL1$

Load *p’=0x88

PL1$

PL1$

PL1$

Load *p=0x80

Load *p’=0x08

Page 9: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

9

SEQ TLS-4core0

0.2

0.4

0.6

0.8

1

CacheSquashOther

Nor

mal

ized

Exe

cutio

n Ti

me

Impact on Cache Performance

60%

load *p=0x88

Cache: p == p’Prefetched (+)

load *p’=0x88

Cache impact is difficult to model statically

A major loop from benchmark art (scanner.c:594)

2.5X

Page 10: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

10

Our proposalProposed Performance Tuning Framework

Compiler Analysis Runtime Analysis

Source code

Profiling info

Region selection

Parallel code generation

Machine code

Speculative code optimization

Accurate Timing & Cache Impacts

Adaptations for inputs and

phase changes

Performance Evaluation

Tuning PolicyCompiler Annotation

RuntimeInformation

Dynamic Execution

Performance Profile Table

Page 11: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

11

Parallelization Decisions at Runtime

Evaluate performance

Thread spawning End of a parallel execution

spawn

parallelize

spawnspawn

serialize

spawn

Parallel execution Sequential execution

Parallel Execution

New execution model facilitates dynamic tuning

Performance Profile Table

Page 12: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

12

Evaluating Performance ImpactRuntime Analysis

Simple metric: cost of speculative failure

Comprehensive evaluation: sequential performance prediction

SEQ TLS-4core0

0.2

0.4

0.6

0.8

1

CacheSquashOther

Nor

mal

ized

Exe

cutio

n Ti

me

Performance Evaluation

Tuning Policy

RuntimeInformation

Dynamic ExecutionPerformance Profile Table

Page 13: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

13

Sequential Performance Prediction

PredictP P

Programmable Cycle Classifier

Information from ROB

Hardware Counters

iFetch

tick

CacheBusy

ExeStall

SpeculativeParallel

Predicted sequential

Busy

iFetch

ExeStall

Cache

TLSoverhead

Performance Impact

Page 14: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

14

Cache Behavior: Scenario #1

TLS

Count the miss in squashed execution

load p(unmodified)

miss

load p hit

Predicted

load p hit

SEQ

load p miss

T0

T0

T0 T0

Squash

Page 15: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

15

Cache Behavior: Scenario #2

TLS

load p miss

load p miss

Predicted

load p miss*2

SEQ

load p missstore p

invalidate p

store p

Not count a miss if tag matched invalid status

store p

T0 T0 T0

T0

Page 16: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

16

Tune the performance

Exhaustive search

Prioritized search

Runtime Analysis

Performance Evaluation

Tuning Policy

RuntimeInformation

Dynamic ExecutionPerformance Profile Table

Page 17: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

17

How to prioritize the searchSearch order: Inside-Out Compiler suggestion Outside-In

Performance summary metric: Speedup or not? Improvement in cycles

Search order

Perf

orm

ance

Su

mm

ary

met

ric

Spee

dup

or n

ot?

Inside-Out Compilersuggestion

Impr

ove.

in c

ycle

sMore

Aggressive

Quantitative+StaticHintQuantitative

Simple

Suboptimal

Page 18: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

Evaluation Infrastructure

Benchmarks· SPEC2000 written in C, -O3 optimization

Underlying architecture· 4-core, chip-multiprocessor (CMP)· speculation supported by coherence

Simulator· Superscalar with detailed memory model· simulates communication latency· models bandwidth and contention

Detailed, cycle-accurate simulation

C

C

P

C

P

Interconnect

C

P

C

P

18

Page 19: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

19

Comparing Tuning Policies

ammp artbzip

2cra

fty

equake gapgcc gzip mcf

mesaparse

r

perlbmk

twolf

vorte

xvp

r-p vpr-r

G.M.

0

0.5

1

1.5

2

2.5

3

Simple Quantitative Quantitative+StaticHint

Spee

dup

wrt

. SEQ

1.17x

1.23x

1.37x

Parallel Code Overhead

Page 20: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

20

Profile-Based vs. Dynamic

ammp artbzip

2cra

fty

equake gapgcc gzip mcf

mesaparse

r

perlbmk

twolf

vorte

xvp

r-p vpr-r

G.M.

0

0.5

1

1.5

2

2.5

3

ProfileBased Quant+StaticHintProfileBased/ProfileBasedSEQ (Quant+StaticHint)/dynamicSEQ

Spee

dup

wrt

. SEQ

Dynamic outperformed static by ≈10% (1.37x/1.25x)

1.25x

Potentially go up to ≈15% (1.46x/1.27x)

1.27x

1.46x

Page 21: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

21

Related works: Region Selection

Runtime Loop Detection

[Tubella HPCA’98]

[Marcuello ICS’98]

Compiler Heuristics

[Vijaykumar Micro’98]

Statically Dynamically

Saving Power from Useless Re-spawning

[Renau ICS’05]

Simple Profiling Pass

[Liu PPoPP’06] (POSH)

Extensive Profiling

[Wang LCPC’05]

Balanced Min-Cut

[Johnson PLDI’04]

Profile-Based Search

[Johnson PPoPP’07]

Lack of adaptability Lack of high-levelinformation

Dynamic Performance Tuning[Luo ISCA’09] (this work)

Page 22: Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept

22

Conclusions

Deciding how to extract speculative threads at runtimeQuantitatively summarize performance profileCompiler hints avoid suboptimal decisions

Compiler and hardware work together. 37% speedup over sequential execution ≈10% speedup over profile-based decisions

Dynamic performance tuning can improve TLS efficiency

Configurable HPMs enable efficient dynamic optimizations