24
Pipelining 1 Changelog Changes made in this version not seen in first lecture: 5 October 2017: put addq timing slide before critical path slides 5 October 2017: slide 25: arrows to fetch/decode registers point to outputs, not inputs 5 October 2017: slide 27: sum with no pipelining was 550 ps, not 500 ps 5 October 2017: slide 33: e_dstE and W_dstE labels were swapped 5 October 2017: slide 34: rA should have been D_rA, e_valA should have been d_valA 1 logistics exam graded — scores on gradebook keys on Collab view exam, submit regrade requests via TPEGS note: two questions dropped outside of TPEGS median: 83.5; 25th percentile: 77.3; 75th percentile: 90.7 HCL homework due next Wednesday 2 Human pipeline: laundry Washer Dryer Folding Table 11:00 12:00 13:00 14:00 whites whites whites colors colors colors 3

Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

Pipelining

1

Changelog

Changes made in this version not seen in first lecture:5 October 2017: put addq timing slide before critical path slides5 October 2017: slide 25: arrows to fetch/decode registers point tooutputs, not inputs5 October 2017: slide 27: sum with no pipelining was 550 ps, not 500 ps5 October 2017: slide 33: e_dstE and W_dstE labels were swapped5 October 2017: slide 34: rA should have been D_rA, e_valA shouldhave been d_valA

1

logistics

exam graded — scores on gradebookkeys on Collabview exam, submit regrade requests via TPEGSnote: two questions dropped outside of TPEGS

median: 83.5; 25th percentile: 77.3; 75th percentile: 90.7

HCL homework due next Wednesday

2

Human pipeline: laundry

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

3

Page 2: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

Human pipeline: laundry

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

3

Waste (1)

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

wasted time!wasted time!

4

Waste (1)

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

wasted time!wasted time!

4

Waste (2)

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

5

Page 3: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

Latency — Time for One

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

6

Latency — Time for One

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

6

Latency — Time for One

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

6

Throughput — Rate of Many

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

time between finishes (0.83 h)

1 load0.83h = 1.2 loads/h

time between starts (0.83 h)

7

Page 4: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

Throughput — Rate of Many

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

time between finishes (0.83 h)

1 load0.83h = 1.2 loads/h

time between starts (0.83 h)

7

Throughput — Rate of Many

Washer

Dryer

FoldingTable

11:00 12:00 13:00 14:00

whites

whites

whites

colors

colors

colors

sheets

sheets

sheets

time between finishes (0.83 h)

1 load0.83h = 1.2 loads/h

time between starts (0.83 h)

7

times three circuit

AADDADD

ADDADD

2 × A

3 × A

Aadd A + A

2 × Aadd 2A + A

3 × A

7

14

21

0 ps 50 ps 100 ps100 ps latency =⇒ 10 results/ns throughput

8

times three circuit

AADDADD

ADDADD

2 × A

3 × A

Aadd A + A

2 × Aadd 2A + A

3 × A

7

14

21

0 ps 50 ps 100 ps

100 ps latency =⇒ 10 results/ns throughput

8

Page 5: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

times three circuit

AADDADD

ADDADD

2 × A

3 × A

Aadd A + A

2 × Aadd 2A + A

3 × A

7

14

21

0 ps 50 ps 100 ps100 ps latency =⇒ 10 results/ns throughput

8

times three and repeat

Aadd A + A

2 × Aadd 2A + A

3 × A

7

14

17

34

4

8

1

2

23

46

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

Aadd A + A

2 × Aadd 2A + A

3 × A

7

14

21

17

34

51

4

8

12

1

2

3

23

46

69

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

9

times three and repeat

Aadd A + A

2 × Aadd 2A + A

3 × A

7

14

17

34

4

8

1

2

23

46

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

Aadd A + A

2 × Aadd 2A + A

3 × A

7

14

21

17

34

51

4

8

12

1

2

3

23

46

69

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

9

pipelined times three

A (t + 2)

ADDADD

ADDADD2 × A (t + 1)

A (t + 1) 3 × A (t + 0)

A (t + 2)A (t + 1)

2 × A (t + 1)3 × A (t + 0)

7714

21

171734

10

Page 6: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

pipelined times three

A (t + 2)

ADDADD

ADDADD2 × A (t + 1)

A (t + 1) 3 × A (t + 0)

A (t + 2)A (t + 1)

2 × A (t + 1)3 × A (t + 0)

7714

21

171734

10

register tolerances

register outputregister input

outputchanges

inputmust notchange

register delay

11

register tolerances

register outputregister input

outputchanges

inputmust notchange

register delay

11

register tolerances

register outputregister input

outputchanges

inputmust notchange

register delay

11

Page 7: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

times three pipeline timing

A (t + 2)

ADDADD

ADDADD2 × A (t + 1)

A (t + 1) 3 × A (t + 0)

10 ps 50 ps 10 ps 50 ps 10 ps

throughput: 160 ps ≈ 16 G operations/sec

12

times three pipeline timing

A (t + 2)

ADDADD

ADDADD2 × A (t + 1)

A (t + 1) 3 × A (t + 0)

10 ps 50 ps 10 ps 50 ps 10 ps

throughput: 160 ps ≈ 16 G operations/sec

12

deeper pipeline

A (t + 4)

ADDADD

ADDADD2 × A (t + 2)

A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now?throughput: 135 ps ≈ 28 G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly)throughput: 140 ps ≈ 25 G ops/sec

A (t + 3)

2 × Apartial results

3 × Apartial results

Problem: How much faster can we get?

Problem: Can we even do this?

13

deeper pipeline

A (t + 4)

ADDADD

ADDADD2 × A (t + 2)

A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now?throughput: 135 ps ≈ 28 G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly)throughput: 140 ps ≈ 25 G ops/sec

A (t + 3)

2 × Apartial results

3 × Apartial results

Problem: How much faster can we get?

Problem: Can we even do this?

13

Page 8: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

deeper pipeline

A (t + 4)

ADDADD

ADDADD2 × A (t + 2)

A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now?

throughput: 135 ps ≈ 28 G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly)throughput: 140 ps ≈ 25 G ops/sec

A (t + 3)

2 × Apartial results

3 × Apartial results

Problem: How much faster can we get?

Problem: Can we even do this?

13

deeper pipeline

A (t + 4)

ADDADD

ADDADD2 × A (t + 2)

A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now?

throughput: 135 ps ≈ 28 G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly)throughput: 140 ps ≈ 25 G ops/sec

A (t + 3)

2 × Apartial results

3 × Apartial results

Problem: How much faster can we get?

Problem: Can we even do this?

13

deeper pipeline

A (t + 4)

ADDADD

ADDADD2 × A (t + 2)

A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now?throughput: 135 ps ≈ 28 G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly)throughput: 140 ps ≈ 25 G ops/sec

A (t + 3)

2 × Apartial results

3 × Apartial results

Problem: How much faster can we get?

Problem: Can we even do this?

14

diminishing returns: register delays

logic (all)

100 ps

110 psper cycle

10 ps

logic (1/2)

50 ps

60 psper cycle

10 ps

logic (2/2)

50 ps 10 ps

logic (1/3)

33 ps

43 psper cycle

10 ps

logic (2/3)

33 ps 10 ps

logic (3/3)

33 ps 10 ps... ... ... ...

1 ps

11 psper cycle

10 ps 1 ps 10 ps 1 ps 10 ps 1 ps 10 ps

15

Page 9: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

diminishing returns: register delays

2 4 6 8 10 12 140

20

40

60

80

100

120

number of stages

time

perc

ompl

etio

n(p

s)

16

diminishing returns: register delays

2 4 6 8 10 12 140

20

40

60

80

100

120

register delay

number of stages

time

perc

ompl

etio

n(p

s)

16

diminishing returns: register delays

2 4 6 8 10 12 140

20

40

60

80

100

120

register delay

1.83x speedup

1.02x speedup

number of stages

time

perc

ompl

etio

n(p

s)

16

diminishing returns: register delays

2 4 6 8 10 12 140

20

40

60

80

100

1.83x throughput

1.02x throughput

number of stages

thro

ughp

ut(o

ps/n

s)

17

Page 10: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

diminishing returns: register delays

2 4 6 8 10 12 140

20

40

60

80

100

1.83x throughput

1.02x throughput

max. rate of register updates

number of stages

thro

ughp

ut(o

ps/n

s)

17

deeper pipeline

A (t + 4)

ADDADD

ADDADD2 × A (t + 2)

A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now?throughput: 135 ps ≈ 28 G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly)throughput: 140 ps ≈ 25 G ops/sec

A (t + 3)

2 × Apartial results

3 × Apartial results

Problem: How much faster can we get?

Problem: Can we even do this?

18

deeper pipeline

A (t + 4)

ADDADD

ADDADD2 × A (t + 2)

A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now?throughput: 135 ps ≈ 28 G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly)

throughput: 140 ps ≈ 25 G ops/sec

A (t + 3)

2 × Apartial results

3 × Apartial results

Problem: How much faster can we get?

Problem: Can we even do this?

19

deeper pipeline

A (t + 4)

ADDADD

ADDADD2 × A (t + 2)

A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now?throughput: 135 ps ≈ 28 G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly)

throughput: 140 ps ≈ 25 G ops/sec

A (t + 3)

2 × Apartial results

3 × Apartial results

Problem: How much faster can we get?

Problem: Can we even do this?

19

Page 11: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily?Probably not...

logic (all)

100 ps

110 psper cycle

10 ps

logic (1/2)

60 ps

70 psper cycle

10 ps

logic (2/2)

45 ps 10 ps

logic(1/3)

40 ps

50 psper cycle

10 ps

logic(2/3)

40 ps 10 ps

logic(3/3)

30 ps 10 ps... ... ... ... 20

diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily?Probably not...

logic (all)

100 ps

110 psper cycle

10 ps

logic (1/2)

60 ps

70 psper cycle

10 ps

logic (2/2)

45 ps 10 ps

logic(1/3)

40 ps

50 psper cycle

10 ps

logic(2/3)

40 ps 10 ps

logic(3/3)

30 ps 10 ps... ... ... ... 20

diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily?Probably not...

logic (all)

100 ps

110 psper cycle

10 ps

logic (1/2)

60 ps

70 psper cycle

10 ps

logic (2/2)

45 ps 10 ps

logic(1/3)

40 ps

50 psper cycle

10 ps

logic(2/3)

40 ps 10 ps

logic(3/3)

30 ps 10 ps... ... ... ... 20

textbook SEQ ‘stages’

conceptual order only

Fetch: read instruction memory

Decode: read register file

Execute: arithmetic (ALU)

Memory: read/write data memory

Writeback: write register file

PC Update: write PC register

writes happenat end of cycle

reads — “magic”like combinatorial logicas values available

21

Page 12: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

textbook SEQ ‘stages’

conceptual order only

Fetch: read instruction memory

Decode: read register file

Execute: arithmetic (ALU)

Memory: read/write data memory

Writeback: write register file

PC Update: write PC register

writes happenat end of cycle

reads — “magic”like combinatorial logicas values available

21

textbook SEQ ‘stages’

conceptual order only

Fetch: read instruction memory

Decode: read register file

Execute: arithmetic (ALU)

Memory: read/write data memory

Writeback: write register file

PC Update: write PC register

writes happenat end of cycle

reads — “magic”like combinatorial logicas values available

21

textbook stages

conceptual order only pipeline stages

Fetch/PC Update: read instruction memory;compute next PC

Decode: read register file

Execute: arithmetic (ALU)

Memory: read/write data memory

Writeback: write register file

5 stagesone instruction in eachcompute next to start immediatelly

22

textbook stages

conceptual order only pipeline stages

Fetch/PC Update: read instruction memory;compute next PC

Decode: read register file

Execute: arithmetic (ALU)

Memory: read/write data memory

Writeback: write register file

5 stagesone instruction in eachcompute next to start immediatelly

22

Page 13: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

addq CPU

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

decode execute

writeback

signal skips two stages

fetch andPC update

23

addq CPU

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

decode execute

writeback

signal skips two stages

fetch andPC update

23

addq CPU

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

decode execute

writeback

signal skips two stages

fetch andPC update

23

addq CPU

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

decode execute

writeback

signal skips two stages

fetch andPC update

23

Page 14: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

pipelined addq processor

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2fetch andPC update

decode execute

writeback

fetch/decode

decode/execute

execute/writeback

fetch/fetch

24

pipelined addq processor

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

fetch andPC update

decode execute

writeback

fetch/decode

decode/execute

execute/writeback

fetch/fetch

24

pipelined addq processor

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

fetch andPC update

decode execute

writeback

fetch/decode

decode/execute

execute/writeback

fetch/fetch

24

pipelined addq processor

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

fetch andPC update

decode execute

writeback

fetch/decode

decode/execute

execute/writeback

fetch/fetch

24

Page 15: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

addq execution

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

fetch/decode

decode/execute

execute/writeback

fetch/fetch

addq %r8, %r9 // (1)addq %r10, %r11 // (2)

addq %r8, %r9 //(1)

address of (2)

addq %r10, %r11 //(2)

reg #s 8, 9 from (1)reg #s 10, 11 from (2)reg # 9,values for (1)

25

addq execution

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

fetch/decode

decode/execute

execute/writeback

fetch/fetch

addq %r8, %r9 // (1)addq %r10, %r11 // (2)

addq %r8, %r9 //(1)

address of (2)

addq %r10, %r11 //(2)

reg #s 8, 9 from (1)reg #s 10, 11 from (2)reg # 9,values for (1)

25

addq execution

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

fetch/decode

decode/execute

execute/writeback

fetch/fetch

addq %r8, %r9 // (1)addq %r10, %r11 // (2)

addq %r8, %r9 //(1)

address of (2)

addq %r10, %r11 //(2)

reg #s 8, 9 from (1)

reg #s 10, 11 from (2)reg # 9,values for (1)

25

addq execution

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

fetch/decode

decode/execute

execute/writeback

fetch/fetch

addq %r8, %r9 // (1)addq %r10, %r11 // (2)

addq %r8, %r9 //(1)

address of (2)

addq %r10, %r11 //(2)

reg #s 8, 9 from (1)

reg #s 10, 11 from (2)reg # 9,values for (1)

25

Page 16: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

addq processor timing

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

// initially %r8 = 800,// %r9 = 900, etc.addq %r8, %r9addq %r10, %r11addq %r12, %r13addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstEcycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE0 0x01 0x2 8 92 0x4 10 11 800 900 93 0x6 12 13 1000 1100 11 1700 94 9 8 1200 1300 13 2100 115 1700 800 8 2500 136 2500 8

fetch/decode decode/execute execute/writeback

26

addq processor timing

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

// initially %r8 = 800,// %r9 = 900, etc.addq %r8, %r9addq %r10, %r11addq %r12, %r13addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstEcycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE0 0x01 0x2 8 92 0x4 10 11 800 900 93 0x6 12 13 1000 1100 11 1700 94 9 8 1200 1300 13 2100 115 1700 800 8 2500 136 2500 8

fetch/decode decode/execute execute/writeback

26

addq processor timing

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

// initially %r8 = 800,// %r9 = 900, etc.addq %r8, %r9addq %r10, %r11addq %r12, %r13addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstEcycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE0 0x01 0x2 8 92 0x4 10 11 800 900 93 0x6 12 13 1000 1100 11 1700 94 9 8 1200 1300 13 2100 115 1700 800 8 2500 136 2500 8

fetch/decode decode/execute execute/writeback

26

addq processor timing

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

// initially %r8 = 800,// %r9 = 900, etc.addq %r8, %r9addq %r10, %r11addq %r12, %r13addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstEcycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE0 0x01 0x2 8 92 0x4 10 11 800 900 93 0x6 12 13 1000 1100 11 1700 94 9 8 1200 1300 13 2100 115 1700 800 8 2500 136 2500 8

fetch/decode decode/execute execute/writeback

26

Page 17: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

addq processor timing

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

// initially %r8 = 800,// %r9 = 900, etc.addq %r8, %r9addq %r10, %r11addq %r12, %r13addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstEcycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE0 0x01 0x2 8 92 0x4 10 11 800 900 93 0x6 12 13 1000 1100 11 1700 94 9 8 1200 1300 13 2100 115 1700 800 8 2500 136 2500 8

fetch/decode decode/execute execute/writeback

26

addq processor performance

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

example delays:path timeadd 2 80 psinstruction memory 200 psregister file read 125 psadd 100 psregister file write 125 ps

no pipelining: 1 instruction per 550 psadd up everything but add 2 (critical (slowest) path)

pipelining: 1 instruction per 200 ps + pipeline register delaysslowest path through stage + pipeline register delayslatency: 800 ps + pipeline register delays (4 cycles)

27

critical path

every path from state output to state input needs enough timeoutput — may change on rising edge of clockinput — must be stable sufficiently before rising edge of clock

critical path: slowest of all these paths — determines cycle timetimes three: slowest part of ALU ended up mattering

matters with or without pipelining

28

SEQ paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

path 1: 25 picoseconds path 2: 50 picosecondspath 3: 400 picoseconds path 4: 900 picoseconds… …

…and many, many more paths

29

Page 18: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

SEQ paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

path 1: 25 picoseconds

path 2: 50 picosecondspath 3: 400 picoseconds path 4: 900 picoseconds… …

…and many, many more paths

29

SEQ paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

path 1: 25 picoseconds path 2: 50 picoseconds

path 3: 400 picoseconds path 4: 900 picoseconds… …

…and many, many more paths

29

SEQ paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

path 1: 25 picoseconds path 2: 50 picosecondspath 3: 400 picoseconds

path 4: 900 picoseconds… …

…and many, many more paths

29

SEQ paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

path 1: 25 picoseconds path 2: 50 picosecondspath 3: 400 picoseconds path 4: 900 picoseconds… …

…and many, many more paths

29

Page 19: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

SEQ paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

path 1: 25 picoseconds path 2: 50 picosecondspath 3: 400 picoseconds path 4: 900 picoseconds… …

…and many, many more paths

29

sequential addq paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path 1: 25 picosecondspath 2: 375 picosecondspath 3: 500 picosecondspath 4: 500 picosecondsoverall cycle time: 500 picoseconds (longest path)

30

sequential addq paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path 1: 25 picoseconds

path 2: 375 picosecondspath 3: 500 picosecondspath 4: 500 picosecondsoverall cycle time: 500 picoseconds (longest path)

30

sequential addq paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path 1: 25 picosecondspath 2: 375 picoseconds

path 3: 500 picosecondspath 4: 500 picosecondsoverall cycle time: 500 picoseconds (longest path)

30

Page 20: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

sequential addq paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path 1: 25 picosecondspath 2: 375 picosecondspath 3: 500 picoseconds

path 4: 500 picosecondsoverall cycle time: 500 picoseconds (longest path)

30

sequential addq paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path 1: 25 picosecondspath 2: 375 picosecondspath 3: 500 picosecondspath 4: 500 picoseconds

overall cycle time: 500 picoseconds (longest path)

30

sequential addq paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path 1: 25 picosecondspath 2: 375 picosecondspath 3: 500 picosecondspath 4: 500 picosecondsoverall cycle time: 500 picoseconds (longest path)

30

pipelined addq paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2 path 1: 80 picosecondspath 2: 210 picosecondspath 3: 210 picosecondspath 4: 135 picosecondspath 5: 110 picosecondspath 6: 135 picoseconds…overall cycle time: 210 picoseconds

31

Page 21: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

pipelined addq paths

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2 path 1: 80 picosecondspath 2: 210 picosecondspath 3: 210 picosecondspath 4: 135 picosecondspath 5: 110 picosecondspath 6: 135 picoseconds…overall cycle time: 210 picoseconds

31

exercise

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path timeadd 2 50 psinstruction memory 200 psregister file read 125 psadd 100 psregister file write 125 ps

pipeline register delay: 10ps

how will throughput improve if we double the speed of theinstruction memory?A. 2.00x B. 1.70x to 1.99xC. 1.60x to 1.69x D. 1.50x to 1.59xE. less than 1.50x

32

exercise

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

path timeadd 2 50 psinstruction memory 200 psregister file read 125 psadd 100 psregister file write 125 ps

pipeline register delay: 10ps

how will throughput improve if we double the speed of theinstruction memory?A. 2.00x B. 1.70x to 1.99xC. 1.60x to 1.69x D. 1.50x to 1.59xE. less than 1.50x

1135

÷ 1210

= 1.56x — D

32

pipeline register naming convention

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

split

0xF

ADDADD

add 2

f_rA D_rA

d_dstE E_dstE

e_dstEW_dstE33

Page 22: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

pipeline register naming convention

f — fetch sends values here

D — decode receives values here

d — decode sends values here

34

addq HCL

.../* f: from fetch */f_rA = i10bytes[12..16];f_rB = i10bytes[12..16];

/* fetch to decode *//* f_rA -> D_rA, etc. */register fD {

rA : 4 = REG_NONE;rB : 4 = REG_NONE;

}

/* D: to decoded: from decode */

d_dstE = D_rB;/* use register file: */reg_srcA = D_rA;d_valA = reg_outputA;...

/* decode to execute */register dE {

dstE : 4 = REG_NONE;valA : 64 = 0;valB : 64 = 0;

}

...35

SEQ without stages

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

36

SEQ with stages

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

fetch

decode executememory

writeback

rule: signal to next stage (except flow control)

37

Page 23: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

SEQ with stages

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

fetch

decode executememory

writeback

rule: signal to next stage (except flow control)

37

SEQ with stages

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

fetch

decode executememory

writeback

rule: signal to next stage (except flow control)

37

SEQ with stages (actually sequential)

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

fetch

decode executememory

writeback38

adding pipeline registers

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

fetch

decode executememory

writeback

not shown — control logic

39

Page 24: Changelogcr4bd/3330/F2017/notes/... · 2017-10-05 · timesthreecircuit A ADD ADD ADD ADD 2×A 3×A A addA+A 2×A add2A+A 3×A 7 14 21 100pslatency0ps 50ps=⇒ 10results/nsthroughput100ps

adding pipeline registers

PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

fetch

decode executememory

writeback

not shown — control logic

39