Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

Pipelining

1

Changelog

Changes made in this version not seen in first lecture:5 October 2017: put addq timing slide before critical path slides5 October 2017: slide 25: arrows to fetch/decode registers point tooutputs, not inputs5 October 2017: slide 27: sum with no pipelining was 550 ps, not 500 ps5 October 2017: slide 33: e_dstE and W_dstE labels were swapped5 October 2017: slide 34: rA should have been D_rA, e_valA shouldhave been d_valA

1

logistics

exam graded — scores on gradebookkeys on Collabview exam, submit regrade requests via TPEGSnote: two questions dropped outside of TPEGS

median: 83.5; 25th percentile: 77.3; 75th percentile: 90.7

HCL homework due next Wednesday

2

Human pipeline: laundry

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

3

Human pipeline: laundry

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

3

Waste (1)

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

wasted time!wasted time!

4

Waste (1)

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

wasted time!wasted time!

4

Waste (2)

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

5

Latency — Time for One

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

6


Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets




6


Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets




6

Throughput — Rate of Many

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

time between finishes (0.83 h)

1 load0.83h = 1.2 loads/h

time between starts (0.83 h)

7


Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets




7


Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets




7

times three circuit

AADDADD

ADDADD

2 × A

3 × A

Aadd A + A

2 × Aadd 2A + A

3 × A

7

14

21

0 ps 50 ps 100 ps100 ps latency =⇒ 10 results/ns throughput

8

times three circuit

AADDADD

ADDADD

2 × A

3 × A

Aadd A + A

2 × Aadd 2A + A

3 × A

7

14

21

0 ps 50 ps 100 ps

100 ps latency =⇒ 10 results/ns throughput

8

times three circuit

AADDADD

ADDADD

2 × A

3 × A

Aadd A + A

2 × Aadd 2A + A

3 × A

7

14

21

0 ps 50 ps 100 ps100 ps latency =⇒ 10 results/ns throughput

8

times three and repeat

Aadd A + A

2 × Aadd 2A + A

3 × A

7

14

17

34

4

8

1

2

23

46

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

Aadd A + A

2 × Aadd 2A + A

3 × A

7

14

21

17

34

51

4

8

12

1

2

3

23

46

69

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

9

times three and repeat

Aadd A + A

2 × Aadd 2A + A

3 × A

7

14

17

34

4

8

1

2

23

46

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

Aadd A + A

2 × Aadd 2A + A

3 × A

7

14

21

17

34

51

4

8

12

1

2

3

23

46

69

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

9

pipelined times three

A (t + 2)

ADDADD

ADDADD2 × A (t + 1)

A (t + 1) 3 × A (t + 0)

A (t + 2)A (t + 1)

2 × A (t + 1)3 × A (t + 0)

7714

21

171734

10

pipelined times three

A (t + 2)

ADDADD


A (t + 1) 3 × A (t + 0)

A (t + 2)A (t + 1)

2 × A (t + 1)3 × A (t + 0)

7714

21

171734

10

register tolerances

register outputregister input

outputchanges

inputmust notchange

register delay

11

register tolerances


outputchanges

inputmust notchange

register delay

11

register tolerances


outputchanges

inputmust notchange

register delay

11

times three pipeline timing

A (t + 2)

ADDADD


A (t + 1) 3 × A (t + 0)

10 ps 50 ps 10 ps 50 ps 10 ps

throughput: 160 ps ≈ 16 G operations/sec

12

times three pipeline timing

A (t + 2)

ADDADD


A (t + 1) 3 × A (t + 0)

10 ps 50 ps 10 ps 50 ps 10 ps

throughput: 160 ps ≈ 16 G operations/sec

12

deeper pipeline

A (t + 4)

ADDADD


A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now?throughput: 135 ps ≈ 28 G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly)throughput: 140 ps ≈ 25 G ops/sec

A (t + 3)

2 × Apartial results


Problem: How much faster can we get?

Problem: Can we even do this?

13

deeper pipeline

A (t + 4)

ADDADD


A (t + 2) 3 × A (t + 0)



30 ps 20 ps


A (t + 3)





13

deeper pipeline

A (t + 4)

ADDADD


A (t + 2) 3 × A (t + 0)


exercise: throughput now?

throughput: 135 ps ≈ 28 G ops/sec

30 ps 20 ps


A (t + 3)





13

deeper pipeline

A (t + 4)

ADDADD


A (t + 2) 3 × A (t + 0)


exercise: throughput now?


30 ps 20 ps


A (t + 3)





13

deeper pipeline

A (t + 4)

ADDADD


A (t + 2) 3 × A (t + 0)



30 ps 20 ps


A (t + 3)





14

diminishing returns: register delays

logic (all)

100 ps

110 psper cycle

10 ps

logic (1/2)

50 ps

60 psper cycle

10 ps

logic (2/2)

50 ps 10 ps

logic (1/3)

33 ps

43 psper cycle

10 ps

logic (2/3)

33 ps 10 ps

logic (3/3)

33 ps 10 ps... ... ... ...

1 ps

11 psper cycle

10 ps 1 ps 10 ps 1 ps 10 ps 1 ps 10 ps

…

15


2 4 6 8 10 12 140

20

40

60

80

100

120

number of stages

time

perc

ompl

etio

n(p

s)

16


2 4 6 8 10 12 140

20

40

60

80

100

120

register delay

number of stages

time

perc

ompl

etio

n(p

s)

16


2 4 6 8 10 12 140

20

40

60

80

100

120

register delay

1.83x speedup

1.02x speedup

number of stages

time

perc

ompl

etio

n(p

s)

16


2 4 6 8 10 12 140

20

40

60

80

100

1.83x throughput

1.02x throughput

number of stages

thro

ughp

ut(o

ps/n

s)

17


2 4 6 8 10 12 140

20

40

60

80

100

1.83x throughput

1.02x throughput

max. rate of register updates

number of stages

thro

ughp

ut(o

ps/n

s)

17

deeper pipeline

A (t + 4)

ADDADD


A (t + 2) 3 × A (t + 0)



30 ps 20 ps


A (t + 3)





18

deeper pipeline

A (t + 4)

ADDADD


A (t + 2) 3 × A (t + 0)



30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly)


A (t + 3)





19

deeper pipeline

A (t + 4)

ADDADD


A (t + 2) 3 × A (t + 0)



30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly)


A (t + 3)





19

diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily?Probably not...

logic (all)

100 ps

110 psper cycle

10 ps

logic (1/2)

60 ps

70 psper cycle

10 ps

logic (2/2)

45 ps 10 ps

logic(1/3)

40 ps

50 psper cycle

10 ps

logic(2/3)

40 ps 10 ps

logic(3/3)

30 ps 10 ps... ... ... ... 20



logic (all)

100 ps

110 psper cycle

10 ps

logic (1/2)

60 ps

70 psper cycle

10 ps

logic (2/2)

45 ps 10 ps

logic(1/3)

40 ps

50 psper cycle

10 ps

logic(2/3)

40 ps 10 ps

logic(3/3)

30 ps 10 ps... ... ... ... 20



logic (all)

100 ps

110 psper cycle

10 ps

logic (1/2)

60 ps

70 psper cycle

10 ps

logic (2/2)

45 ps 10 ps

logic(1/3)

40 ps

50 psper cycle

10 ps

logic(2/3)

40 ps 10 ps

logic(3/3)

30 ps 10 ps... ... ... ... 20

textbook SEQ ‘stages’

conceptual order only

Fetch: read instruction memory

Decode: read register file

Execute: arithmetic (ALU)

Memory: read/write data memory

Writeback: write register file

PC Update: write PC register

writes happenat end of cycle

reads — “magic”like combinatorial logicas values available

21











21











21

textbook stages

conceptual order only pipeline stages

Fetch/PC Update: read instruction memory;compute next PC





5 stagesone instruction in eachcompute next to start immediatelly

22

textbook stages

conceptual order only pipeline stages

Fetch/PC Update: read instruction memory;compute next PC





5 stagesone instruction in eachcompute next to start immediatelly

22

addq CPU

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

decode execute

writeback

signal skips two stages

fetch andPC update

23

addq CPU

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

decode execute

writeback


fetch andPC update

23

addq CPU

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

decode execute

writeback


fetch andPC update

23

addq CPU

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

decode execute

writeback


fetch andPC update

23

pipelined addq processor

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2fetch andPC update

decode execute

writeback

fetch/decode

decode/execute

execute/writeback

fetch/fetch

24


PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

fetch andPC update

decode execute

writeback

fetch/decode

decode/execute

execute/writeback

fetch/fetch

24


PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

fetch andPC update

decode execute

writeback

fetch/decode

decode/execute

execute/writeback

fetch/fetch

24


PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

fetch andPC update

decode execute

writeback

fetch/decode

decode/execute

execute/writeback

fetch/fetch

24

addq execution

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

fetch/decode

decode/execute

execute/writeback

fetch/fetch

addq %r8, %r9 // (1)addq %r10, %r11 // (2)

addq %r8, %r9 //(1)

address of (2)

addq %r10, %r11 //(2)

reg #s 8, 9 from (1)reg #s 10, 11 from (2)reg # 9,values for (1)

25

addq execution

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

fetch/decode

decode/execute

execute/writeback

fetch/fetch

addq %r8, %r9 // (1)addq %r10, %r11 // (2)

addq %r8, %r9 //(1)

address of (2)

addq %r10, %r11 //(2)

reg #s 8, 9 from (1)reg #s 10, 11 from (2)reg # 9,values for (1)

25

addq execution

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

fetch/decode

decode/execute

execute/writeback

fetch/fetch

addq %r8, %r9 // (1)addq %r10, %r11 // (2)

addq %r8, %r9 //(1)

address of (2)

addq %r10, %r11 //(2)

reg #s 8, 9 from (1)

reg #s 10, 11 from (2)reg # 9,values for (1)

25

addq execution

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

fetch/decode

decode/execute

execute/writeback

fetch/fetch

addq %r8, %r9 // (1)addq %r10, %r11 // (2)

addq %r8, %r9 //(1)

address of (2)

addq %r10, %r11 //(2)

reg #s 8, 9 from (1)

reg #s 10, 11 from (2)reg # 9,values for (1)

25

addq processor timing

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

// initially %r8 = 800,// %r9 = 900, etc.addq %r8, %r9addq %r10, %r11addq %r12, %r13addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstEcycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE0 0x01 0x2 8 92 0x4 10 11 800 900 93 0x6 12 13 1000 1100 11 1700 94 9 8 1200 1300 13 2100 115 1700 800 8 2500 136 2500 8

fetch/decode decode/execute execute/writeback

26


PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2




26


PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2




26


PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2




26


PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2




26

addq processor performance

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

example delays:path timeadd 2 80 psinstruction memory 200 psregister file read 125 psadd 100 psregister file write 125 ps

no pipelining: 1 instruction per 550 psadd up everything but add 2 (critical (slowest) path)

pipelining: 1 instruction per 200 ps + pipeline register delaysslowest path through stage + pipeline register delayslatency: 800 ps + pipeline register delays (4 cycles)

27

critical path

every path from state output to state input needs enough timeoutput — may change on rising edge of clockinput — must be stable sufficiently before rising edge of clock

critical path: slowest of all these paths — determines cycle timetimes three: slowest part of ALU ended up mattering

matters with or without pipelining

28

SEQ paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

path 1: 25 picoseconds path 2: 50 picosecondspath 3: 400 picoseconds path 4: 900 picoseconds… …

…and many, many more paths

29

SEQ paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0


write?

functionof opcode

PC+9

instr.length+

path 1: 25 picoseconds

path 2: 50 picosecondspath 3: 400 picoseconds path 4: 900 picoseconds… …


29

SEQ paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0


write?

functionof opcode

PC+9

instr.length+

path 1: 25 picoseconds path 2: 50 picoseconds

path 3: 400 picoseconds path 4: 900 picoseconds… …


29

SEQ paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0


write?

functionof opcode

PC+9

instr.length+

path 1: 25 picoseconds path 2: 50 picosecondspath 3: 400 picoseconds

path 4: 900 picoseconds… …


29

SEQ paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0


write?

functionof opcode

PC+9

instr.length+



29

SEQ paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0


write?

functionof opcode

PC+9

instr.length+



29

sequential addq paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path 1: 25 picosecondspath 2: 375 picosecondspath 3: 500 picosecondspath 4: 500 picosecondsoverall cycle time: 500 picoseconds (longest path)

30


PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path 1: 25 picoseconds

path 2: 375 picosecondspath 3: 500 picosecondspath 4: 500 picosecondsoverall cycle time: 500 picoseconds (longest path)

30


PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path 1: 25 picosecondspath 2: 375 picoseconds

path 3: 500 picosecondspath 4: 500 picosecondsoverall cycle time: 500 picoseconds (longest path)

30


PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path 1: 25 picosecondspath 2: 375 picosecondspath 3: 500 picoseconds

path 4: 500 picosecondsoverall cycle time: 500 picoseconds (longest path)

30


PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path 1: 25 picosecondspath 2: 375 picosecondspath 3: 500 picosecondspath 4: 500 picoseconds

overall cycle time: 500 picoseconds (longest path)

30


PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path 1: 25 picosecondspath 2: 375 picosecondspath 3: 500 picosecondspath 4: 500 picosecondsoverall cycle time: 500 picoseconds (longest path)

30

pipelined addq paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2 path 1: 80 picosecondspath 2: 210 picosecondspath 3: 210 picosecondspath 4: 135 picosecondspath 5: 110 picosecondspath 6: 135 picoseconds…overall cycle time: 210 picoseconds

31

pipelined addq paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2 path 1: 80 picosecondspath 2: 210 picosecondspath 3: 210 picosecondspath 4: 135 picosecondspath 5: 110 picosecondspath 6: 135 picoseconds…overall cycle time: 210 picoseconds

31

exercise

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path timeadd 2 50 psinstruction memory 200 psregister file read 125 psadd 100 psregister file write 125 ps

pipeline register delay: 10ps

how will throughput improve if we double the speed of theinstruction memory?A. 2.00x B. 1.70x to 1.99xC. 1.60x to 1.69x D. 1.50x to 1.59xE. less than 1.50x

32

exercise

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path timeadd 2 50 psinstruction memory 200 psregister file read 125 psadd 100 psregister file write 125 ps

pipeline register delay: 10ps

how will throughput improve if we double the speed of theinstruction memory?A. 2.00x B. 1.70x to 1.99xC. 1.60x to 1.69x D. 1.50x to 1.59xE. less than 1.50x

1135

÷ 1210

= 1.56x — D

32

pipeline register naming convention

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

f_rA D_rA

d_dstE E_dstE

e_dstEW_dstE33

pipeline register naming convention

f — fetch sends values here

D — decode receives values here

d — decode sends values here

…

34

addq HCL

.../* f: from fetch */f_rA = i10bytes[12..16];f_rB = i10bytes[12..16];

/* fetch to decode *//* f_rA -> D_rA, etc. */register fD {

rA : 4 = REG_NONE;rB : 4 = REG_NONE;

}

/* D: to decoded: from decode */

d_dstE = D_rB;/* use register file: */reg_srcA = D_rA;d_valA = reg_outputA;...

/* decode to execute */register dE {

dstE : 4 = REG_NONE;valA : 64 = 0;valB : 64 = 0;

}

...35

SEQ without stages

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0


write?

functionof opcode

PC+9

instr.length+

36

SEQ with stages

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0


write?

functionof opcode

PC+9

instr.length+

fetch

decode executememory

writeback

rule: signal to next stage (except flow control)

37

SEQ with stages

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0


write?

functionof opcode

PC+9

instr.length+

fetch


writeback


37

SEQ with stages

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0


write?

functionof opcode

PC+9

instr.length+

fetch


writeback


37

SEQ with stages (actually sequential)

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0


write?

functionof opcode

PC+9

instr.length+

fetch


writeback38

adding pipeline registers

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0


write?

functionof opcode

PC+9

instr.length+

fetch


writeback

not shown — control logic

39

adding pipeline registers

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0


write?

functionof opcode

PC+9

instr.length+

fetch


writeback

not shown — control logic

39

Documents

Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps