76
Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin— Madison http://www.ece.wisc.edu/~pharm

Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

Embed Size (px)

Citation preview

Page 1: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

Lazy Logic

Mikko H. LipastiAssociate Professor

Department of Electrical and Computer Engineering

University of Wisconsin—Madison

http://www.ece.wisc.edu/~pharm

Page 2: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

CMOS History CMOS has been a faithful servant

40+ years since invention Tremendous advances

Device size, integration level Voltage scaling Yield, manufacturability, reliability

Nearly 20 years now as high-performance workhorse

Result: life has been easy for architects Ease leads to complacency & laziness

Page 3: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

CMOS Futures“The reports of my demise are greatly

exaggerated.” – Mark Twain CMOS has some life left in it

Device scaling will continue What comes after CMOS…

Many new challenges Process variability Device reliability Leakage power Dynamic power Focus of this talk

Page 4: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Dynamic Power

Static CMOS: current flows when transistors switch

Combinational logic evaluates new inputs Flip-flop, latch captures new value (clock edge)

Terms C: capacitance of circuit

wire length, number and size of transistors V: supply voltage A: activity factor f: frequency

Architects can/should focus on Ci x Ai Reduce capacitance of each unit Reduce activity of each unit

unitsi

iiidyn fAVCkP 2

Page 5: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Design Objective Inversion Historically, hardware was expensive

Every gate, wire, cable, unit mattered Squeeze maximum utilization from each

Now, power is expensive On-chip devices & wires, not so much Should minimize Ci x Ai

Logic should be simple, infrequently used Both sequential and combinational

Lazy Logic

Page 6: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic

Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling

Conclusions Research Group Overview

Page 7: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

What is Lazy Logic? Design philosophy Some overall principles

Minimize unit utilization Minimize unit complexity OK to increase number of

units/wires/devices As long as reduced Ai (activity) compensates Don’t forget leakage

Result Reject conventional “good ideas” Reduce power without loss of performance Sometimes improve performance

Page 8: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Lazy Logic Applications CMP interconnection networks

Old: Packet-switched, store-and-forward New: Circuit-switched, reconfigurable

Stall cycle redistribution Transparent pipelines want fine-grained

stalls Redistribute coarse stalls into fine stalls

High-performance dynamic scheduling Cycle time goal achieved by replicating

ALUs

Page 9: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

CMP Interconnection Networks Options

Buses don’t scale Crossbars are too

expensive Rings are too slow Packet-switched

mesh Attractive for all the

DSM reasons Scalable Low latency High link utilization

Page 10: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

CMP Interconnection Networks

But… Cables/traces are now

on-chip wires Fast, cheap, plentiful Short: 1 cycle per hop

Router latency adds up 3-4 cpu cycles per hop

Store-and-forward Lots of activity/power

Is this the right answer?

Page 11: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Circuit-switched Interconnects Communication

patterns Spatial locality to

memory Pairwise

communication Circuit-switched links

Avoid switching/routing

Reduce latency Save power?

Page 12: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Router Design

Switches can be logically configured to appear as wires (no routing overhead)

Can also act as packet-switched network Can switch back and forth very easily Detailed router design not presented here

NSE W

P

Page 13: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Dirty Miss coverage

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Circuit-Switched Connections/Processor

% o

f D

irty

Mis

se

s

SPECjbbSPECwebTPC-HTPC-W

Page 14: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Directory Protocol Initial 3-hop miss establishes CS path Subsequent miss requests

Sent directly on CS path to predicted owner Also in parallel to home node Predicted owner sources data early Directory acks update to sharing list

Benefits Reduced 3-hop latency Less activity, less power

Page 15: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Circuit-switched Performance

0

0.2

0.4

0.6

0.8

1

1.2

TP

C-H

SP

EC

jbb

20

00

SP

EC

we

b9

9

TP

C-W

Ba

rne

s-H

ut

Oce

an

Ra

dio

sity

No

rma

lize

d C

yc

le C

ou

nt

Base Fully connected, Oracle Limit 1, Oracle Limit 1, Region Prediction

Page 16: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Link Activity

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%T

PC

-H

SP

EC

jbb

20

00

SP

EC

we

b9

9

TP

C-W

Ba

rne

s-H

ut

Oce

an

Ra

dio

sity

No

rma

lize

d L

ink

Ac

tiv

ity

Limit 1, Oracle Limit 1, Region Prediction

Page 17: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Buffer Activity

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%T

PC

-H

SP

EC

jbb

20

00

SP

EC

we

b9

9

TP

C-W

Ba

rne

s-H

ut

Oce

an

Ra

dio

sity

No

rma

lize

d I

np

ut

bu

ffe

r A

cti

vit

y

Limit 1, Oracle Limit 1, Region Prediction

Page 18: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Circuit-switched Coherence Summary

Reconfigurable interconnect Circuit-switched links

Some performance benefit Substantial reduction in activity Current status (slides are out of date)

Router design and physical/area models Protocol tuning and tweaks, etc. Initial results in CA Letters paper

Page 19: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic

Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling

Conclusions Research Group Overview

Page 20: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

April 21, 2023 Eric L. Hill – Preliminary Exam 20

Pipeline Clocking Revisited

AB

Two units of work, 10 clock pulses

Latches clocked to propagate data

Conventional pipeline clock gating Each valid work unit gets clocked into each latch This is needlessly conservative

Page 21: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

April 21, 2023 Eric L. Hill – Preliminary Exam 21

Transparent Pipeline Gating

AB

Two units of work, 5 clock pulses

return

Transparent pipelining: novel approach to clocking [Jacobsen 2004, 2005]

Both master and slave latch can remain transparent Gating logic ensures no races Pipeline registers are clocked lazily only when race occurs

Quite effective for low utilization pipelines Gaps between valid work units enable transparent mode

Page 22: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Applications Best suited for low utilization pipelines

E.g. FP, Media processing functional units High utilization pipelines see least benefit

E.g. Instruction fetch pipelines To benefit from transparent approach:

Valid data items need fine-grained gaps (stalls)

1-cycle gap provides lion’s share (50%)

Page 23: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Application: Front-end Pipelines Provide back-end with sufficient

supply of instructions to find ILP High branch prediction accuracy Low instruction cache miss rates Little opportunity for clock gating

Designed to feed peak demand Poor match for transparent

pipeline gating

Page 24: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

In-Order Execution Model In-order Cores

Power efficient Low design complexity Throughput oriented

CMP systems trending towards simple cores (e.g. Sun Niagara)

Data dependences cause fine-grained stalls at dispatch

Can we project these back to fetch?

Exploit fetch slack

time

Page 25: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

April 21, 2023 Eric L. Hill – Preliminary Exam 25

Pipeline Diagram

BpredPC

bpred update

0x0

RPInstruction

Fetch

Execution Core

clock vector

Issue Buffer

Page 26: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Available Fetch Slack

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

fracti

on

of

instr

ucti

on

gro

up

s o

bserv

ed

7+

6

5

4

3

2

1

0

Page 27: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Implementation Stall cycle bits embedded in BTB

EPIC ISAs (IA64) could use stop bits Verify prediction by observing

unperturbed groups Let high confidence groups

periodically execute unperturbed Observe overall increase in execution

time Modeled Cell PPU-like PowerPC

core with aggressive clock gating

Page 28: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Latch Activity Reduction

0

0.2

0.4

0.6

0.8

1

1.2

no

rmali

zed

latc

h a

cti

vit

y f

acto

r

scr

scr+tcg

Page 29: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

FE Energy Delay Product

0

0.2

0.4

0.6

0.8

1

1.2

no

rma

lize

d f

ron

t e

nd

en

erg

y-d

ela

y p

roje

ct

(j*s

)

fe_latch

bpred

icache

base

scr

scr+

tpg

Page 30: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Stall Cycle Redistribution Summary [ISLPED 2006]

Transparent pipelines reduce latch activity Not effective in pipelines with coarse-

grained stalls (e.g. fetch) Coarse-grained stalls can be redistributed

without affecting performance (fetch slack)

Benefits Equivalent performance, lower power Transparent fetch pipeline now attractive

Page 31: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic

Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling

Conclusions Research Group Overview

Page 32: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

A Brief Scheduler Overview

Fetch DecodeSched/Exe

WritebackCommit

Atomic Sched/Exe

Fetch Decode ScheduleDispatch RF Exe WritebackCommit

wakeup/select

Fetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommitFetch Decode ScheduleDispatch RF Exe WritebackCommit

Wakeup/Select

Fetch Decode ScheduleDispatch RF Exe WritebackCommit

Wakeup/Select

Spec wakeup/select

Fetch Decode ScheduleDispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Fetch Decode ScheduleDispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Spec wakeup/select

Fetch Decode ScheduleDispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Fetch Decode ScheduleDispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Fetch Decode ScheduleDispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Fetch Decode ScheduleDispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Fetch Decode ScheduleDispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Latency Changed!!

Fetch Decode ScheduleDispatch RF ExeWriteback/Recover

Commit

Re-schedulewhen latency mispredicted

Invalid input value

Speculatively issued instructions

Fetch Decode ScheduleDispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Data capture/ non-data capture scheduler

Speculative scheduling

Data capture scheduler desirable for many reasonsCycle time is not competitive because of data path

delay Current machines use speculative scheduling

Misscheduled/replayed instructions burn power Depending on recovery policy, up to 17% issued insts need to

replay

Page 33: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Slicing the Core

Bitslice the core: narrow (16b) and wide (64b) Narrow core can be full data capture

Still makes aggressive cycle time (with lazy logic) Completely nonspeculative, virtually no replays Further power benefits (not in this talk)

Front-End Back-End

OoO Core

Page 34: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Dynamic Scheduling with Partial Operand Values

Narrow core Computes partial operand Determines load latency Avoids misscheduling

Wide core Computes the rest of the operand (if needed)

wakeup/select

Fetch DecodeSched &Nrw Exe

Dispatch RF ExeWriteback/Recover

CommitFetch DecodeSched &Nrw Exe

Dispatch RF ExeWriteback/Recover

Commit

wakeup/select

Fetch DecodeSched &Nrw Exe

Dispatch RF ExeWriteback/Recover

CommitFetch DecodeSched &Nrw Exe

Dispatch RF ExeWriteback/Recover

CommitFetch DecodeSched &Nrw Exe

Dispatch RF ExeWriteback/Recover

CommitFetch DecodeSched &Nrw Exe

Dispatch RF ExeWriteback/Recover

CommitFetch DecodeSched &Nrw Exe

Dispatch RF ExeWriteback/Recover

Commit

the rest of the data

Page 35: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Scheduler w/ Narrow Data-Path

Non-data capture schedulerSelect – mux – tag bcast

& compare – ready wrR O B ID D ata1T ag1 D ata2T ag2

= =

... ......

...

... sele

ct lo

gic

...

Dest

(1)

(2)

To W ide Data Path

In t ALULSQ C ache

Adder

...

(a)

Naïve narrow data capture schedulerSelect – mux – tag bcast

& compare – ready wr

Select – mux – narrow ALU – data bcast – data wr

Increased cycle time

Page 36: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

R O B ID D ata1T ag1R D ata2T ag2R

= =

......

...

... ......

Dest

(1)

(2)

To W ide D ata P ath

In t ALU

Int ALUse

lect

logi

c

(b)

M M

LS Q C ache

latc

h

Scheduler w/ Embedded ALUs

With embedded ALUsSelect – mux – tag bcast

& compare – ready wrMax(select, data bcast – mux – narrow ALU) – mux – latch setup

Lazy LogicReplicated ALUsLow utilizationOff critical delay

path

Page 37: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Cycle Time, Area, Energy 32 entries, implemented using verilog Synthesized using Synopsis Design

Compiler and LSI Logic’s gflxp 0.11um

1.43

1.53

1.49

1.98

Area (mm2)

1.54

1.48

1.46

1.40

Energy(nJ)

2.04Full-Data Capture

1.28Non-Data Capture

1.28Narrow-Data Capture w/ ALUs

1.71Narrow-Data Capture

Cycle Time (ns)

Page 38: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Dynamic Scheduling Summary

Benefits: [JILP 2007] Save 25-30% of total OoO window energy

=> 12-18% total dynamic chip power Reduce misspeculated loads by 75%-80% Slightly improved IPC Comparable cycle time

Enabled by: Lazy narrow ALUs ALUs are cheap, so compute in parallel

with scheduling select logic

Page 39: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic

Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling

Conclusions Research Group Overview

Page 40: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Conclusions Lazy Logic

Promising new design philosophy Some overall principles

Minimize unit utilization Minimize unit complexity OK to increase number of

units/wires/devices Initial Results

Circuit-switched CMP interconnects Stall cycle redistribution Dynamic Scheduling

Page 41: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Who Are We? Faculty: Mikko Lipasti Current Ph.D. students:

Profligate execution: Gordie Bell (joining IBM in 2006) Coarse-grained coherence: Jason Cantin (joining IBM in 2006) Lazy Logic

Circuit-switched coherence: Natalie Enright Stall cycle redistribution: Eric Hill Dynamic scheduling: Erika Gunadi

Dynamic code optimization: Lixin Su SMT/CMP scheduling/resource allocation: Dana Vantrease

Pharmed out: IBM: Trey Cain, Brian Mestan AMD: Kevin Lepak Intel: Ilhyun Kim, Morris Marden, Craig Saldanha, Madhu

Seshadri Sun Microsystems: Matt Ramsay, Razvan Cheveresan, Pranay

Koka

Page 42: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Research Group Overview Faculty: Mikko Lipasti, since 1999 Current MS/PhD students

Gordie Bell, Natalie Enright Jerger, Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana Vantrease

Graduates, current employment: AMD: Kevin Lepak IBM: Trey Cain, Jason Cantin, Brian Mestan Intel: Ilhyun Kim, Morris Marden, Craig

Saldanha, Madhu Seshadri Sun Microsystems: Matt Ramsay, Razvan

Cheveresan, Pranay Koka

Page 43: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Current Focus Areas Multiprocessors

Coherence protocol optimization Interconnection network design Fairness issues in hierarchical systems

Microprocessor design Complexity-effective microarchitecture Scalable dynamic scheduling hardware Speculation reduction for power savings Transparent clock gating Domain-specific ISA extensions

Software Java Virtual Machine run-time optimization Workload development and characterization

Page 44: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Funding IBM

Faculty Partnership Awards Shared University Research equipment

Intel Research council support Equipment donations

National Science Foundation CSA, ITR, NGS, CPA Career Award

Schneider ECE Faculty Fellowship UW Graduate School

Page 45: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Questions?http://www.ece.wisc.edu/

~pharm

Page 46: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Questions?

Page 47: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Backup slides

Page 48: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Technology Parameters 65 nm technology generation 16 tiled processors

Approximately 4 mm x 4mm Signal can travel approximately 4

mm/cycle Circuit switched interconnect

consists of 5 mm unidirectional links

Page 49: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Broadcast Protocol Broadcast to all nodes Establish Circuit-Switched path with

owner of data Future broadcasts will use Circuit-

Switched path to reduce power Predict when CS path will suffice

Use LRU information for paths to tear down old paths when resources need to be claimed by new path

Page 50: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Switch Design from paper

E

ProcessorCM

CM

CM

CM

CM

CM = Configuration Memory

N

S

W

Buffer

Page 51: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Race example from paper (1 of 2)

P0 P1 P2

Dir3

1a. CS Req

4. CS Resp (S)

2.

Upgrad

e

5.

Invalidate

6. Inval Resp

1b.

CS Notify

3.

7.

Downgrade

Page 52: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Race example (2 of 2)

P0 P1 P2

Dir3

1a. CS Req

4a. CS Resp (S)5.

Invalidate

6. Inval Resp

1b.

CS Notify

3.

4b. Nack2. Upgrade

Page 53: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

LRU pairs for Dirty Misses

23 or fewer pairs capture >80% of dirty misses for 3 out of 4 benchmarks (16p)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 10

19

28

37

46

55

64

73

82

91

10

0

10

9

11

8

12

7

13

6

14

5

15

4

16

3

17

2

18

1

19

0

19

9

20

8

21

7

22

6

23

5

Specjbb

specweb

tpch

tpcw

Page 54: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Local LRU pairs

2 Circuit-Switched Paths per processor covers between 55% and 85% of dirty misses

Miss Rate (Local LRU)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Specjbb

specweb

tpch

tpcw

Page 55: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Concurrent Links

5 concurrent links cover 90% necessary pairs Captures 50%-77% of overall opportunity

2 Circuit-Switched Paths per Processor

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

110.00%

1 2 3 4 5 6 7 8 9

SpecJBB

Specweb

TPC-H

TPC-W

Page 56: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Experimental Setup PHARMsim

Activity-based power model based on Wattch added

InOrder issue 4/2/2 fetch/issue/commit (based on Cell PPU) 10 stage transparent front-end pipeline

(conventional latches at endpoints) Gshare (8k entry) branch predictor, 1024 set,

4-way BTB 32KB I/D cache (1/4), 512KB L2 cache (12) 4 confidence bits / >4 high conf threshold /

predictions checked randomly 10% of the time Benchmarks simulated for 250M instructions

Page 57: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Branch Predictor Activity

0

0.2

0.4

0.6

0.8

1

1.2

no

rma

lize

d b

pre

d a

cti

vit

y

scr_extra

normal

Page 58: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Related Work Removing Wrong Path Instructions

[Manne 1998] Flow Based Throttling Techniques

[Baniasadi 2001, Karkhanis 2002]

Page 59: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Future Work Explore performance of other fetch

gating schemes with transparent pipelining

Explore dependence driven gating on Itanium machine model

Explore latch soft error vulnerability (TVF) when lazy clocking is used

Explore change in AVF when fetch gating is used Less ACE state in-flight

Page 60: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

LD

ADD

OR

Cachemiss

AND

BR

Scheduling Replay Example

Squashing/non-selective replay – alpha 21264 Replays all dependent and independent instructions

issued under load shadow Analogous to squashing recovery in branch

misprediction Simple but high performance penalty

Independent instructions are unnecessarily replayedSched Disp RF Exe Retire

Invalidate & replay ALL instructions in the load

shadow

LD

ADD

OR

AND

BR

LD

ADD

OR

AND

BR

LD

ADD

OR

AND

BR

missresolvedLD

ADD

OR

AND

BR

Page 61: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Narrow Core Narrow Scheduler

Captures partial operands Determines load latency (hit/miss)

Narrow Data-Path Narrow ALU – provides partial data to consumers Nar row LSQ and partial tag cache

Finds only possible load data source Uses least significant 16 bits

Large enough to help predict load latency Small enough to achieve fast cycle time

Page 62: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

L/S Disambiguation &Partial Tag Matching

Exploits operand significance[Brooks et.al. 1999, Canal et al. 2000]

Load/store disambiguation 10 bits finds 99% of matching stores

Partial tag match 16 bits for 97%(mcf) - 99%(bzip2)

accuracy

Page 63: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Outline Motivation Dynamic Scheduling with Narrow

Values Scheduler with Narrow Data-Path Pipelined Data Cache Pipeline Integration

Implementation and Experiments Conclusions and Future Work

Page 64: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Dynamic Scheduling withPartial Operands

Stores a subset of operands in scheduler Exploits partial operand knowledge

Load-store disambiguation Partial tag match

Front-End Back-End

OoO Core

Page 65: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Pipelined Cache w/ Early Bits

TagA rray

D ataA rray

Com parator Muxes

TagS ubarray

D ataS ub-array

Com parator Muxes

C om para tor

N arrow B ank W ide B ank

Row

Decoder

Row

Decoder

Subarr

ay D

ecoder

Subarr

ay D

ecoder

To N arrow D ata P ath To W ide D ata P ath

P artia l B its

Fu

ll B

its

La

tch

La

tch

La

tch

La

tch

La

tch

D isp1 D isp2

D isp1 D isp2 A gen

Narrow bank for partial access, wide bank for the rest

Uses partial tag match in narrow bank Saves power in wide bank Hide wide cache bank latency by starting early

Page 66: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Narrow LSQ Stores partial addresses of stores Used for partial load-store

disambiguation Accessed in parallel with narrow

bank Saves power in the wide LSQ

Cheaper direct mapped access rather than full associative search

Page 67: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Pipeline Integration

Simple ALU insts link dependences in back-to-back cycle

Fetch D ecode R enam e Q ueue S ched D isp D isp

P artia lLoad

In tA LU

M ult/D iv M ult/D iv M ult/D iv

A genC ache

W B C om m itD ecodeD ecodeFetch

C ache

Complex ALU insts link dependences non-speculatively

Load insts need another cycle to schedule dependences

Page 68: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Pipelined Data Cache & LSQ Modeled using modified CACTI 3.0 Configuration: 16KB, 4-way, 64B blocks

(1.21 + 0.40) mm2

(1.50 + 0.40) mm2

Total Area

(0.62 + 0.11) nJ(0.37 + 0.08) nJ Total Energy Consumption (Cache + LSQ)

1.24ns0.60nsAccess Latency – Wide Bank

N/A0.80nsAccess Latency – Narrow Bank

Conventional Data Cache

PipelinedData Cache

Page 69: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Experiments Simplescalar / Alpha 3.0 tool set Machine Model

64-entry ROB 4-wide fetch/issue/commit 16-entry SQ, 16-entry LQ 32-entry scheduler 13-stage pipeline 64KB I-Cache (2-cyc), 16KB D-Cache (2-

cyc) 2-cycle store to load forwarding

Page 70: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Energy Dissipation

On average narrow captured scheduling consume 25% less energy than non-data captured scheduling

0

0.2

0.4

0.6

0.8

1

bzip2 mcf parser vpr avg

Benchmarks

To

tal E

ne

rgy

narrow_refetch

narrow_squash

squash

parallel_selective

Page 71: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Mispredicted Load Instructions

Reduce misspeculated loads by 75%-80%

0

2

4

6

8

10

12

14

bzip2 mcf parser vpr

Benchmarks

Nu

mb

er

of

Mis

sc

he

du

led

Lo

ad

Ins

tru

cti

on

s

(mill

ion

s)

miss-forward

store no-data

misalign store

cache alias

cache miss

Page 72: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Optimized model Using refetch replay scheme to

reduce replay complexity Clear the scheduler entries once

instructions are issued Decreases scheduler occupancy Instructions enters OoO window

sooner Reduce L1 cache latency from 2-

cycle to 1-cycle

Page 73: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Optimized Model Performance

Small variations Always perform as good or better

0.5

1

1.5

2

bzip2 mcf parser vpr avg

Benchmarks

Sp

eed

Up

improved narrow_refetch

narrow_refetch

narrow_squash

squash

selective

Page 74: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Future Work Implement a more accurate

dynamic power model Study custom design vs.

synthesized model Study opportunities for leakage

power reduction

Page 75: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Delay Model

Processor 0 can reach Processor 15 in 9 fewer cycles

Circuit Switched Interconnect

4

3

2

-- 432

976

764

643

Baseline Store and Forward Mesh

9

6

3

-- 963

181512

15129

1296

Page 76: Lazy Logic Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison pharm

July 6, 2007Mikko Lipasti, University of Wisconsin Seminar--University of

Toronto

Pipeline Unrolling