Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Thread-Level SpecCan Be Ener

Jose RenauU i it f C lif i t S

Karin Strauss Luis C

University of California at Sahttp://masc.soe.ucsc.edu

Karin Strauss, Luis CJames Tuck, and JosUniversity of Illinois at Urbahttp://iacoma.cs.uiuc.edu

culation on a CMP rgy Efficient

t C

Ceze Wei Liu Smruti Sarangi

anta Cruz

Ceze, Wei Liu, Smruti Sarangi, ep Torrellasna-Champaign

Wire delay:N t i l lithiNot a single monolithic processo

Power:Energy-efficient design (simpleEnergy efficient design (simple

Complexity:Very large block reuse

Chip Multiprocessor withTh d L l S l iThread Level Speculatio

The 19th ACM International Conference on Supercomputing

Challenges

or

cores are efficient)cores are efficient)

Clock Reach

h?on?

g 2

Thread Le

for(i=0;X[Y[i]S ti l X[Y[i]

}Sequential

Compilers cannot parallelizTLS: Assume no dependen

for( ;X[Y[i]TLS Task A

i=0

for(

X[Y[i]}

TLS Task A

i=n/for(X[Y[i]

}

TLS Task Bi=n/


}

evel Speculation (TLS)

i<n;i++) {] X[Z[i]]] = X[Z[i]]...

zences, hardware verifies

; ;i++) {] = X[Z[i]]...i<n/2

; ;i++) {

] X[Z[i]]...

/2 i<n; ;i++) {] = X[Z[i]]...

/2 i<n

g 3

Thread Le

TLS Hardware:Tracks data accesses at run-tiDetects dependence violationsKills and restarts tasks

S ti l TLS ( d i lSequential TLS (no dep viola

A A BB

B


evel Speculation (TLS)

mes

ti ) TLS (d i l ti )

B

ations) TLS (dep violation)

AB B

B

B

g 4

Contrary to common wisContrary to common wisenergy-e

Identify the sources of energPropose novel energy-centrDesign energy-efficient memCMP


Contributions

sdom TLS CMP can besdom, TLS CMP can be efficient

gy waste in TLSric optimizationsmory hierarchy for TLS

g 5

TLS: 27% faster and

1 CPU 6-is1 CPU 3-issue

6-issue: 23% faster and 5


Main Results

d 28% more energy

TLS CMP

ssue 4 CPUs 3-issue with TLS

52% more energy

g 6

Source of energy waste W

Task squash DeTask squash De

Additional storage & logic Neg gin memory system Ne

Additional traffic SMin memory system pr

Additional instructions CoAdditional instructions Co


Energy Cost of TLS

Why?

ependence violationependence violation

eed to version dataeed to version data

M aware cache coherence rotocol

ompiler overheadompiler overhead

g 7



Additional traffic TLin memory system pr



Energy Cost of TLS

Why?


LS aware cache coherence rotocol


g 8







Energy Cost of TLS

Why?





g 9

Addit

Cache lines are associated toCache line tags are extended wCache line tags are extended w

Version ID TagVersion ID

Messages between caches c


tional Storage & Logic

o taskswith Version IDwith Version ID

Data

compare Version IDs

g 10







Energy Cost of TLS

Why?





g 11

More cache misses: C t di l l ti d tCannot displace speculative dat

Handling multiple versions:Example: Find the correct versioExample: Find the correct versio

Detect data dependence violNeed extra checks


Additional Traffic

tta

on on cache misson on cache missations across tasks

g 12

Energy-

Substantial energy reductio

Source of energy waste En

Overlooked in performance-

Source of energy waste En

Task squash StaEn

Additional storage & logic Av

Additional traffic

Additional instructions En


Centric Optimizations

n

nergy-centric optimization

-centric designs

nergy centric optimizationall after second restart nergy-aware profilinggy p g

void walking the cache

---

nergy-aware profiling

g 13

A

Spawn & commit instructionsC ti l il tiConventional compiler optimcode partitioning into tasksLive ins spillingLive-ins spilling

~15% additional i


Additional Instructions

s inserted by compileri ti t ff ti d tizations not so effective due to

instructions

g 14

Reduce number of checksR d t f h h kReduce cost of each checkEliminate low-return work

Example: Energy-aware profPrune tasks that are expected toPrune tasks that are expected toenergy cost


Design Philosophy

filingo give minor speedups at higho give minor speedups at high

g 15

Sim

Uni-4i Uni-6i

70 @ 5GH (

1 CPU 6-i1 CPU 3-issue

70nm @ 5GHz (same area apAll processors have same pipe16KB L1 cache (1 cycle slowe1MB L2 h hi1MB L2 cache on-chip


mulation Environment

TLS4-3i

)

issue 4 CPUs 3-issue with TLS

prox.)eline depthr in TLS due to versioning)

g 16


Performance

g 17


Power

g 18


Cost of TLS

g 19


Cost of TLS

g 20


Cost of TLS

g 21


Cost of TLS

g 22


Cost of TLS

g 23


Cost of TLS

g 24

Conclusions: TL

6issue

rgy

23% speedup+87% power

Ener

3issue

p

3issue

27% speedup, +59%

Performance

Results for sing


LS Power is Promising

3% speedup, -15% power

TLS

% power

gle thread applications (SPECint2000)

g 25

Quesi

Jose RenauU i it f C lif i t S

Karin Strauss Luis C

University of California at Sahttp://masc.soe.ucsc.edu

Karin Strauss, Luis CJames Tuck, and JosUniversity of Illinois at Urbahttp://iacoma.cs.uiuc.edu

ions?

t C

Ceze Wei Liu Smruti Sarangi

anta Cruz

Ceze, Wei Liu, Smruti Sarangi, ep Torrellasna-Champaign

BacSlid

kupdes


Processor

g 28

Documents

Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB