28
Thread-Level Spec Can Be Ener Jose Renau Ui it f C lif i tS Karin Strauss Luis C University of California at Sa http://masc.soe.ucsc.edu Karin Strauss, Luis C James Tuck, and Jos University of Illinois at Urba http://iacoma.cs.uiuc.edu culation on a CMP rgy Efficient t C Ceze Wei Liu Smruti Sarangi anta Cruz Ceze, Wei Liu, Smruti Sarangi, ep Torrellas na-Champaign

Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Thread-Level SpecCan Be Ener

Jose RenauU i it f C lif i t S

Karin Strauss Luis C

University of California at Sahttp://masc.soe.ucsc.edu

Karin Strauss, Luis CJames Tuck, and JosUniversity of Illinois at Urbahttp://iacoma.cs.uiuc.edu

culation on a CMP rgy Efficient

t C

Ceze Wei Liu Smruti Sarangi

anta Cruz

Ceze, Wei Liu, Smruti Sarangi, ep Torrellasna-Champaign

Page 2: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Wire delay:N t i l lithiNot a single monolithic processo

Power:Energy-efficient design (simpleEnergy efficient design (simple

Complexity:Very large block reuse

Chip Multiprocessor withTh d L l S l iThread Level Speculatio

The 19th ACM International Conference on Supercomputing

Challenges

or

cores are efficient)cores are efficient)

Clock Reach

h?on?

g 2

Page 3: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Thread Le

for(i=0;X[Y[i]S ti l X[Y[i]

}Sequential

Compilers cannot parallelizTLS: Assume no dependen

for( ;X[Y[i]TLS Task A

i=0

for(

X[Y[i]}

TLS Task A

i=n/for(X[Y[i]

}

TLS Task Bi=n/

The 19th ACM International Conference on Supercomputing

}

evel Speculation (TLS)

i<n;i++) {] X[Z[i]]] = X[Z[i]]...

zences, hardware verifies

; ;i++) {] = X[Z[i]]...i<n/2

; ;i++) {

] X[Z[i]]...

/2 i<n; ;i++) {] = X[Z[i]]...

/2 i<n

g 3

Page 4: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Thread Le

TLS Hardware:Tracks data accesses at run-tiDetects dependence violationsKills and restarts tasks

S ti l TLS ( d i lSequential TLS (no dep viola

A A BB

B

The 19th ACM International Conference on Supercomputing

evel Speculation (TLS)

mes

ti ) TLS (d i l ti )

B

ations) TLS (dep violation)

AB B

B

B

g 4

Page 5: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Contrary to common wisContrary to common wisenergy-e

Identify the sources of energPropose novel energy-centrDesign energy-efficient memCMP

The 19th ACM International Conference on Supercomputing

Contributions

sdom TLS CMP can besdom, TLS CMP can be efficient

gy waste in TLSric optimizationsmory hierarchy for TLS

g 5

Page 6: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

TLS: 27% faster and

1 CPU 6-is1 CPU 3-issue

6-issue: 23% faster and 5

The 19th ACM International Conference on Supercomputing

Main Results

d 28% more energy

TLS CMP

ssue 4 CPUs 3-issue with TLS

52% more energy

g 6

Page 7: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Source of energy waste W

Task squash DeTask squash De

Additional storage & logic Neg gin memory system Ne

Additional traffic SMin memory system pr

Additional instructions CoAdditional instructions Co

The 19th ACM International Conference on Supercomputing

Energy Cost of TLS

Why?

ependence violationependence violation

eed to version dataeed to version data

M aware cache coherence rotocol

ompiler overheadompiler overhead

g 7

Page 8: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Source of energy waste W

Task squash DeTask squash De

Additional traffic TLin memory system pr

Additional instructions CoAdditional instructions Co

The 19th ACM International Conference on Supercomputing

Energy Cost of TLS

Why?

ependence violationependence violation

LS aware cache coherence rotocol

ompiler overheadompiler overhead

g 8

Page 9: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Source of energy waste W

Task squash DeTask squash De

Additional storage & logic Neg gin memory system Ne

Additional traffic TLin memory system pr

Additional instructions CoAdditional instructions Co

The 19th ACM International Conference on Supercomputing

Energy Cost of TLS

Why?

ependence violationependence violation

eed to version dataeed to version data

LS aware cache coherence rotocol

ompiler overheadompiler overhead

g 9

Page 10: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Addit

Cache lines are associated toCache line tags are extended wCache line tags are extended w

Version ID TagVersion ID

Messages between caches c

The 19th ACM International Conference on Supercomputing

tional Storage & Logic

o taskswith Version IDwith Version ID

Data

compare Version IDs

g 10

Page 11: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Source of energy waste W

Task squash DeTask squash De

Additional storage & logic Neg gin memory system Ne

Additional traffic TLin memory system pr

Additional instructions CoAdditional instructions Co

The 19th ACM International Conference on Supercomputing

Energy Cost of TLS

Why?

ependence violationependence violation

eed to version dataeed to version data

LS aware cache coherence rotocol

ompiler overheadompiler overhead

g 11

Page 12: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

More cache misses: C t di l l ti d tCannot displace speculative dat

Handling multiple versions:Example: Find the correct versioExample: Find the correct versio

Detect data dependence violNeed extra checks

The 19th ACM International Conference on Supercomputing

Additional Traffic

tta

on on cache misson on cache missations across tasks

g 12

Page 13: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Energy-

Substantial energy reductio

Source of energy waste En

Overlooked in performance-

Source of energy waste En

Task squash StaEn

Additional storage & logic Av

Additional traffic

Additional instructions En

The 19th ACM International Conference on Supercomputing

Centric Optimizations

n

nergy-centric optimization

-centric designs

nergy centric optimizationall after second restart nergy-aware profilinggy p g

void walking the cache

---

nergy-aware profiling

g 13

Page 14: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

A

Spawn & commit instructionsC ti l il tiConventional compiler optimcode partitioning into tasksLive ins spillingLive-ins spilling

~15% additional i

The 19th ACM International Conference on Supercomputing

Additional Instructions

s inserted by compileri ti t ff ti d tizations not so effective due to

instructions

g 14

Page 15: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Reduce number of checksR d t f h h kReduce cost of each checkEliminate low-return work

Example: Energy-aware profPrune tasks that are expected toPrune tasks that are expected toenergy cost

The 19th ACM International Conference on Supercomputing

Design Philosophy

filingo give minor speedups at higho give minor speedups at high

g 15

Page 16: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Sim

Uni-4i Uni-6i

70 @ 5GH (

1 CPU 6-i1 CPU 3-issue

70nm @ 5GHz (same area apAll processors have same pipe16KB L1 cache (1 cycle slowe1MB L2 h hi1MB L2 cache on-chip

The 19th ACM International Conference on Supercomputing

mulation Environment

TLS4-3i

)

issue 4 CPUs 3-issue with TLS

prox.)eline depthr in TLS due to versioning)

g 16

Page 17: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

The 19th ACM International Conference on Supercomputing

Performance

g 17

Page 18: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

The 19th ACM International Conference on Supercomputing

Power

g 18

Page 19: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

The 19th ACM International Conference on Supercomputing

Cost of TLS

g 19

Page 20: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

The 19th ACM International Conference on Supercomputing

Cost of TLS

g 20

Page 21: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

The 19th ACM International Conference on Supercomputing

Cost of TLS

g 21

Page 22: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

The 19th ACM International Conference on Supercomputing

Cost of TLS

g 22

Page 23: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

The 19th ACM International Conference on Supercomputing

Cost of TLS

g 23

Page 24: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

The 19th ACM International Conference on Supercomputing

Cost of TLS

g 24

Page 25: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Conclusions: TL

6issue

rgy

23% speedup+87% power

Ener

3issue

p

3issue

27% speedup, +59%

Performance

Results for sing

The 19th ACM International Conference on Supercomputing

LS Power is Promising

3% speedup, -15% power

TLS

% power

gle thread applications (SPECint2000)

g 25

Page 26: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

Quesi

Jose RenauU i it f C lif i t S

Karin Strauss Luis C

University of California at Sahttp://masc.soe.ucsc.edu

Karin Strauss, Luis CJames Tuck, and JosUniversity of Illinois at Urbahttp://iacoma.cs.uiuc.edu

ions?

t C

Ceze Wei Liu Smruti Sarangi

anta Cruz

Ceze, Wei Liu, Smruti Sarangi, ep Torrellasna-Champaign

Page 27: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

BacSlid

kupdes

Page 28: Thread-Level Speculation on a CMP Can Be Energy Efficientiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ics05_1.pdf · All processors have same pipe 16KB L1 cache (1 cycle slowe 1MB

The 19th ACM International Conference on Supercomputing

Processor

g 28