Chapter 1 Solutions 2 Chapter 2 Solutions 6 Chapter … · Chapter 2 Solutions 6 Chapter 3 Solutions 19 ... Figure L.2 Baseline performance (in cycles, per loop iteration) of the

Chapter 1 Solutions 2






Solutions to Alternate

Case Study Exercises

2

�

Solutions to Alternate Case Study Exercises

Case Study 1: Chip Fabrication Cost

1.1 a.

b.

It is fabricated in a larger technology, which is an older plant. As plants age,their process gets tuned, and the defect rate decreases.

1.2 a.

Profit = 416

×

0.65

×

$20 = $5408

b.

Profit = 240

×

0.50

×

$25 = $3000

c.

The Woods chip

d.

Woods chips: 50,000/416 = 120.2 wafers neededMarkon chips: 25,000/240 = 104

.

2 wafers needed

Therefore, the most lucrative split is 120 Woods wafers, 30 Markon wafers.

1.3 a.

No defects = 0.28

2

= 0.08One defect = 0.28

×

0.72

×

2 = 0.40No more than one defect = 0.08 + 0.40 = 0.48

b.

$20

×

0.28 = Wafer size/old dpw

= $23.33

Chapter 1 Solutions

Yield 1 0.30 × 3.894.0

----------------------------+

4–0.36= =

Dies per wafer π × 30 2⁄( )2

1.5------------------------------= π × 30

sqrt 2 × 1.5( )-------------------------------– 471 54.4 416=–=

Yield 1 0.30 × 1.54.0

-------------------------+

4–0.65= =

Dies per wafer π × 30 2⁄( )2

2.5------------------------------= π × 30

sqrt 2 × 2.5( )-------------------------------– 283 42.1 240=–=

Yield 1 0.30 × 2.54.0

-------------------------+

4–0.50= =

Defect – Free single core 1 0.75 × 1.99 2⁄4.0

-----------------------------------+

4–0.28= =

$20 Wafer sizeold dpw × 0.28-------------------------------------=

xWafer size

1/2 × old dpw × 0.48---------------------------------------------------- $20 × 0.28

1/2 × 0.48--------------------------= =

Chapter 1 Solutions

�

3

Case Study 2: Power Consumption in Computer Systems

1.4 a.

.80

x

= 66 + 2

×

2.3 + 7.9;

x

= 99

b.

.6

×

4 W + .4

×

7.9 = 5.56

c.

Solve the following four equations:

seek7200 = .75

×

seek5400seek7200 + idle7200 = 100seek5400 + idle5400 = 100seek7200

×

7.9 + idle7200

×

4 = seek5400

×

7 + idle5400

×

2.9

idle7200 = 29.8%

1.5 a.

b.

c.

200 W

×

11 = 2200 W2200 / (76.2) = 28 racksOnly 1 cooling door is required.

1.6 a.

The IBMx346 could take less space, which would save money in real estate.The racks might be better laid out. It could also be much cheaper. In addition,if we were running applications that did not match the characteristics of thesebenchmarks, the IBM x346 might be faster. Finally, there are no reliabilitynumbers shown. Although we do not know that the IBM x346 is better in anyof these areas, we do not know it is worse, either.

1.7 a.

(1 – 8) + .8/2 = .2 + .4 = .6

b.

c.

;

x

= 50%

d.

Case Study 3: The Cost of Reliability (and Failure) in Web Servers

1.8 a.

This date represents 3.6/108

=

3.3% of that quarter’s sales.

Percentage of online sales is 106 million/3.9 billion = 2.7%

Therefore, 3.3% of the 2.7% of the $4.8 billion dollars in sales for 4th quarterin 2005 is $4.3 million for that one day.

b.

$130 million

×

.20

×

.01 = $260,000

14 KW66 W 2.3 W 7.9 W+ +( )

------------------------------------------------------------ 183=

14 KW66 W 2.3 W 2+ + × 7.9 W( )

---------------------------------------------------------------------- 166=

Power newPower old-------------------------- V × 0.60( )2 × F × 0.60( )

V 2 × F-------------------------------------------------------------- 0.63 0.216= = =

1 .751 x–( ) x 2⁄+

--------------------------------=

Power newPower old-------------------------- V × 0.75( )2 × F × 0.60( )

V 2 × F-------------------------------------------------------------- 0.752 × 0.6 0.338= = =

4

�

Solutions to Alternate Case Study Exercises

c.

Assuming the 4.2 million visitors are not unique, but are actually the uniquevisitors each day summed across a month:

4.2 million

×

8.9 = 37.4 million transactions per month

$5.38

×

37.4 million = $201 million per month

1.9 a.

FIT = 10

9

⁄ MTTFMTTF = 10

9

⁄ FIT = 10

9

= 150 = 6,666.667

b.

1.10

MTTF

=

100 = 6

,

666

,

667

=

100 = 66

,

667

1.11 a.

Assuming that we do not repair the computers, we wait for how long it takesfor 20

,

000

×

2

/

5 = 8000 computers to fail.

8000

×

6

,

666

,667 = 5.3 × 109 hours

b. Each computer is responsible for ($2.98 billion /31)/20000 = $4806 in salesper day. The mean time to failure for a single computer of the 20000 is6,666,667/20000 = 333 hours. Therefore, a computer fails every 333 hours/24= 13.9 days. It requires 2 days to fix it, so a single computer is down forapproximately 2 out of every 16 days. This means that they lose, on average,$4806/8 = $600 per day to computers being down.

c. It would cost 20000 × $500 = $10,000,000 to upgrade the computers. Thiswould take 10000000/600 = 16666 days to recoup the costs. This would notbe worth it

Case Study 4: Performance

1.12 a. See Figure L.1.

Chip Memory performance Dhrystone performance

Athlon 64 X2 4800+ 0.49 1.00

Pentium EE 840 0.46 0.91

Pentium D 820 0.43 0.73

Athlon 64 X2 3800+ 0.42 0.83

Pentium 4 0.39 0.37

Athlon 64 3000+ 0.42 0.37

Pentium 4 570 0.50 0.54

Processor X 1.00 0.24

Figure L.1 Performance of several processors normalized to the fastest one.

Availability MTTFMTTF MTTR+--------------------------------------- 6.7 × 106

6.7 × 106 48+----------------------------------- about 100%= = =

Chapter 1 Solutions � 5

b. Dual Processors: .449Single Processors: .536

c. 11210 × x + 3501 × (1 – x) = 15220 × x + 3000 × (1 – x)501 = 4511xx = 11.1%

1.13 a. Pentium 4: .3 × 2731 + .7 × 7621 = 6154Athlon 64 X2 3800+: .3 × 2941 + .7× 17129 = 12873

b. 2941 / 2731 = 1.08x speedup

c. This is a trade-off between memory operations and processor operations. Theadvantage of the look-up table is that it requires fewer instructions to retrievethe data. The disadvantage is that it requires both more memory and morememory operations. By taking more memory, it might displace items in thecache that will be used, thus reducing the hit rate in the cache. In addition,memory operations are difficult for the computer to optimize, and they can bevery high latency, so we are trading a few memory operations for many pro-cessor operations. It is not clear which is the best choice.

1.14 a. Amdahl’s Law:

b. Amdahl’s Law:

c. Amdahl’s Law:

d. Amdahl’s Law:

1.4 .6 2⁄+---------------------- 1.43x speedup=

1.05 .95/2+-------------------------- 1.90x speedup=

1.25 .75 × .4 .6 2⁄+( )+--------------------------------------------------------- 1.29x speedup=

1.25 × .05 .95/2+( ) .75+ × .4 .6 2⁄+( )--------------------------------------------------------------------------------------------- 1.84x speedup=

6 � Solutions to Alternate Case Study Exercises

Case Study 1: Exploring the Impact of Microarchitectural Techniques

2.1 The baseline performance (in cycles, per loop iteration) of the code sequence inFigure 2.35, if no new instruction’s execution could be initiated until the previousinstruction’s execution had completed, is 40. How did I come up with that num-ber? Each instruction requires one clock cycle of execution (a clock cycle inwhich that instruction, and only that instruction, is occupying the execution units;since every instruction must execute, the loop will take at least that many clockcycles). To that base number, we add the extra latency cycles. Don’t forget thebranch shadow cycle.

2.2 How many cycles would the loop body in the code sequence in Figure 2.35require if the pipeline detected true data dependencies and only stalled on those,rather than blindly stalling everything just because one functional unit is busy?The answer is 25, as shown in Figure L.3. Remember, the point of the extralatency cycles is to allow an instruction to complete whatever actions it needs, inorder to produce its correct output. Until that output is ready, no dependentinstructions can be executed. So the first LD must stall the next instruction forthree clock cycles. The MULTD produces a result for its successor, and thereforemust stall 4 more clocks, and so on.

Figure L.2 Baseline performance (in cycles, per loop iteration) of the code sequencein Figure 2.35.

Chapter 2 Solutions

Loop: LD F2,0(Rx) 1 + 4

DIVD F8,F2,F0 1 + 12

MULTD F2,F6,F2 1 + 5

LD F4,0(Ry) 1 + 4

ADDD F4,F0,F4 1 + 1

ADDD F10,F8,F2 1 + 1

ADDI Rx,Rx,#8 1

ADDI Ry,Ry,#8 1

SD F4,0(Ry) 1 + 1

SUB R20,R4,Rx 1

BNZ R20,Loop 1 + 1 ____

cycles per loop iter 40


Figure L.3 Number of cycles required by the loop body in the code sequence in Fig-ure 2.35.

Loop: LD

<stall>

<stall>

<stall>

<stall>

DIVD

MULTD

LD

<stall due to LD latency>




ADDD

<stall due to ADDD latency>

<stall due to DIVD latency>




ADDD

ADDI

ADDI

SD

SUB

BNZ

<stall branch delay slot>

cycles per loop iter

F2,0(Rx)

F8,F2,F0

F2,F6,F2

F4,0(Ry)

F4,F0,F4

F10,F8,F2

Rx,Rx,#8

Ry,Ry,#8

F4,0(Ry)

R20,R4,Rx

R20,Loop

1 + 4

1 + 12

1 + 5

1 + 4

1 + 1

1 + 1

1

1

1 + 1

1

1 + 1

------

25


2.3 Consider a multiple-issue design. Suppose you have two execution pipelines, eachcapable of beginning execution of one instruction per cycle, and enough fetch/decode bandwidth in the front end so that it will not stall your execution. Assumeresults can be immediately forwarded from one execution unit to another, or toitself. Further assume that the only reason an execution pipeline would stall is toobserve a true data dependency. Now how many cycles does the loop require? Theanswer is 22, as shown in Figure L.4. The LD goes first, as before, and the DIVDmust wait for it through 4 extra latency cycles. After the DIVD comes the MULTD,which can run in the second pipe along with the DIVD, since there’s no dependencybetween them. (Note that they both need the same input, F2, and they must bothwait on F2’s readiness, but there is no constraint between them.) The LD followingthe MULTD does not depend on the DIVD nor the MULTD, so had this been a supersca-lar-order-3 machine, that LD could conceivably have been executed concurrentlywith the DIVD and the MULTD. Since this problem posited a two-execution-pipemachine, the LD executes in the cycle following the DIVD/MULTD. The loop overheadinstructions at the loop’s bottom also exhibit some potential for concurrencybecause they do not depend on any long-latency instructions.

Execution pipe 0 Execution pipe 1

Loop: LD F2,0(Rx) ; <nop>

<stall for LD latency> ; <nop>




DIVD F8,F2,F0 ; MULTD F2,F6,F2

LD F4,0(Ry) ; <nop>





ADD F4,F0,F4 ; <nop>

<stall due to DIVD latency> ; <nop>






ADDD F10,F8,F2 ; ADDI Rx,Rx,#8

ADDI Ry,Ry,#8 ; SD F4,0(Ry)

SUB R20,R4,Rx ; BNZ R20,Loop

<nop> ; <stall due to BNZ>


Figure L.4 Number of cycles required per loop.


2.4 Possible answers:

1. If an interrupt occurs between N and N + 1, then N + 1 must not have beenallowed to write its results to any permanent architectural state. Alternatively,it might be permissible to delay the interrupt until N + 1 completes.

2. If N and N + 1 happen to target the same register or architectural state (say,memory), then allowing N to overwrite what N + 1 wrote would be wrong.

3. N might be a long floating-point op that eventually traps. N + 1 cannot beallowed to change arch state in case N is to be retried.

Long-latency ops are at highest risk of being passed by a subsequent op. TheDIVD instr will complete long after the LD F4,0(Ry), for example.

2.5 Figure L.5 demonstrates one possible way to reorder the instructions to improvethe performance of the code in Figure 2.35. The number of cycles that this reor-dered code takes is 20.


Loop: LD F2,0(Rx) ; LD F4,0(Ry)

<stall for LD latency> ; <stall for LD latency>




DIVD F8,F2,F0 ; ADDD F4,F0,F4

MULTD F2,F6,F2 ; <stall due to ADDD latency>

<stall due to DIVD latency> ; SD F4,0(Ry)

<stall due to DIVD latency> ; <nop> #ops: 11

<stall due to DIVD latency> ; <nop> #nops: (20 × 2) – 11 = 29

<stall due to DIVD latency> ; ADDI Rx,Rx,#8

<stall due to DIVD latency> ; ADDI Ry,Ry,#8






<stall due to DIVD latency> ; SUB R20,R4,Rx

ADDD F10,F8,F2 ; BNZ R20,Loop



Figure L.5 Number of cycles taken by reordered code.


2.6 a. Fraction of all cycles, counting both pipes, wasted in the reordered codeshown in Figure L.5:

11 ops out of 2x20 opportunities.1 – 11/40 = 1 – 0.275 = 0.725

b. Results of hand-unrolling two iterations of the loop from code shown in Fig-ure L.6:

c. Speedup =

Speedup = 20 / (22/2)Speedup = 1.82


Loop: LD F2,0(Rx) ; LD F4,0(Ry)

LD F2,0(Rx) ; LD F4,0(Ry)






MULTD F2,F0,F2 ; SD F4,0(Ry)

MULTD F2,F6,F2 ; SD F4,0(Ry)


<stall due to DIVD latency> ; ADDI Rx,Rx,#16

<stall due to DIVD latency> ; ADDI Ry,Ry,#16








ADDD F10,F8,F2 ; SUB R20,R4,Rx

ADDD F10,F8,F2 ; BNZ R20,Loop



Figure L.6 Hand-unrolling two iterations of the loop from code shown in Figure L.5.

exec time w/o enhancementexec time with enhancement--------------------------------------------------------------------


2.7 Consider the code sequence in Figure 2.36. Every time you see a destination reg-ister in the code, substitute the next available T, beginning with T9. Then updateall the src (source) registers accordingly, so that true data dependencies are main-tained. Show the resulting code. (Hint: See Figure 2.37.)

Loop: LD T9,0(Rx)

IO: MULTD T10,F0,T2

I1: DIVD T11,T9,T10

I2: LD T12,0(Ry)

I3: ADDD T13,F0,T12

I4: SUBD T14,T11,T13

I5: SD T14,0(Ry)

Figure L.7 Register renaming.


2.8 See Figure L.8. The rename table has arbitrary values at clock cycle N – 1. Look atthe next two instructions (I0 and I1): I0 targets the F1 register, and I1 will write theF4 register. This means that in clock cycle N, the rename table will have had itsentries 1 and 4 overwritten with the next available Temp register designators. I0gets renamed first, so it gets the first T reg (9). I1 then gets renamed to T10. In clockcycle N, instructions I2 and I3 come along; I2 will overwrite F6, and I3 will writeF0. This means the rename table’s entry 6 gets 11 (the next available T reg), andrename table entry 0 is written to the T reg after that (12). In principle, you don’thave to allocate T regs sequentially, but it’s much easier in hardware if you do.

2.9 See Figure L.9.

Figure L.8 Cycle-by-cycle state of the rename table for every instruction of the codein Figure 2.38.

ADD R1, R1, R1; 5 + 5 −> 10

ADD R1, R1, R1; 10 + 10 −> 20

ADD R1, R1, R1; 20 + 20 −> 40

Figure L.9 Value of R1 when the sequence has been executed.

Renamed in cycle N

Renamed in cycle N + 1

N +1NN –1

0

1

2

3

4

5

6

7

8

9

0

9

2

3

10

5

6

7

8

9

12

1

2

3

4

5

11

7

8

9

Clock cycle

I0:

I1:

I2:

I3:

SUBD

ADDD

MULTD

DIVD

F1,F2,F3

F4,F1,F2

F6,F4,F1

F0,F2,F6

Next availT reg

Ren

ame

tabl

e

62

63

0

1

2

3

4

5

6

7

8

9

62

63

12 11 10 9

62

63

0

1

2

3

4

5

6

7

8

9

62

63

62

63

0

1

2

3

4

5

6

7

8

9

62

63

14 13 12 11 16 15 14 13


2.10 An example of an event that, in the presence of self-draining pipelines, could dis-rupt the pipelining and yield wrong results is shown in Figure L.10.

2.11 See Figure L.11. The convention is that an instruction does not enter the execu-tion phase until all of its operands are ready. So the first instruction, LW R3,0(R0),marches through its first three stages (F, D, E) but that M stage that comes nextrequires the usual cycle plus two more for latency. Until the data from a LD isavailable at the execution unit, any subsequent instructions (especially that ADDIR1, R1, #1, which depends on the 2nd LW) cannot enter the E stage, and musttherefore stall at the D stage.

Figure L.10 Example of an event that yields wrong results. What could go wrongwith this? If an interrupt is taken between clock cycles 1 and 4, then the results of the LWat cycle 2 will end up in R1, instead of the LW at cycle 1. Bank stalls and ECC stalls willcause the same effect—pipes will drain, and the last writer wins, a classic WAW hazard.All other “intermediate” results are lost.

Figure L.11 Phases of each instruction per clock cycle for one iteration of the loop.

alu0

ADDI R10, R4, #1

ADDI R10, R4, #1

ADDI R11, R3, #2

ADDI R2, R2, #16

1

2

3

4

5

6

7

Clock cycle

alu1

ADDI R20, R0, #2

SUB R4, R3, R2

ld/st

LW R4, 0(R0)

LW R4, 0(R0) LW R5, 8(R1)

LW R5, 8(R1)

SW R9, 8(R8)

BNZ R4, Loop

SW R9, 8(R8)

SW R7, 0(R6)

SW R7, 0(R6)

ld/st br

Loop length

Loop:

LW R3,0(R0)

LW R1,0(R3)

ADDI R1,R1,#1

SUB R4,R3,R2

SW R1,0(R3)

BNZ R4, Loop

LW R3,0(R0)

(2.11a) 4 cycles lost to branch overhead

(2.11b) 2 cycles lost with static predictor

(2.11c) No cycles lost with correct dynamic prediction

1

F

2

D

F

3

E

D

F

4

M

–

–

5

–

–

–

6

–

–

–

7

W

E

D

F

8

M

–

–

9

–

–

–

10

–

–

–

11

W

E

D

F

12

M

E

D

F

13

W

M

E

D

14

W

M

E

15

–

–

16

–

–

17

W

M

F

18

W

D

19

...


a. 4 cycles lost to branch overhead. Without bypassing, the results of the SUBinstruction are not available until the SUB’s W stage. That tacks on an extra 4clock cycles at the end of the loop, because the next loop’s LW R1 can’t beginuntil the branch has completed.

b. 2 cycles lost w/ static predictor. A static branch predictor may have a heuristiclike “if branch target is a negative offset, assume it’s a loop edge, and loopsare usually taken branches.” But we still had to fetch and decode the branchto see that, so we still lose 2 clock cycles here.

c. No cycles lost w/ correct dynamic prediction. A dynamic branch predictorremembers that when the branch instruction was fetched in the past, it eventu-ally turned out to be a branch, and this branch was taken. So a “predictedtaken” will occur in the same cycle as the branch is fetched, and the nextfetch after that will be to the presumed target. If correct, we’ve saved all ofthe latency cycles seen in 2.11 (a) and 2.11 (b). If not, we have some cleaningup to do.


Figure L.12 Instructions in code where register renaming improves performance.

LD

DIVD

MULTD

LD

ADDD

ADDD

ADDI

ADDI

SD

SUB

BNZ

F2,0(Rx)

F8,F2,F0

F2,F8,F2

F4,0(Ry)

F4,F0,F4

F10,F8,F2

Rx,Rx,#8

Ry,Ry,#8

F4,0(Ry)

R20,R4,Rx

R20,Loop

; reg renaming doesn’t really help here, due to

; true data dependencies on F8 and F2

; this LD is independent of the previous 3

; instrs and can be performed earlier than

; pgm order. It feeds the next ADDD, and ADDD

; feeds the SD below. But there’s a true data

; dependency chain through all, so no benefit

; This ADDD still has to wait for DIVD latency,

; no matter what you call their rendezvous reg

; rename for next loop iteration

; rename for next loop iteration

; This SD can start when the ADDD’s latency has

; transpired. With reg renaming, doesn’t have

; to wait until the LD of (a different) F4 has

; completed.


b. See Figure L.13. The number of clock cycles taken by the code sequence is25.

Figure L.13 Number of clock cycles taken by the code sequence.

alu0

ADDI Rx,Rx,#8

SUB R20,R4,Rx

MULTD F2,F8,F2

DIVD F8,F2,F0

BNZ R20,Loop

Branch shadow

ADDI Ry,Ry,#8

ADDD F4,F0,F4

ADDD F10,F8,F2

1

2

3

4

5

6

7

8

. . .

18

19

20

21

22

23

24

25

Clock cycle

Note: these ADDI’s aregenerating Rx,y for nextloop iteration, not this one.

Cycle op was dispatched to FU

alu1 ld/st

LD F2,0(Rx)

LD F4,0(Ry)

SD F4,0(Ry)

LD latency

ADDD latency

DIV

D latency

MULTD latency


c. See Figures L.14 and L.15. The bold instructions are those instructions thatare present in the RS, and ready for dispatch. Think of this exercise from theReservation Station’s point of view: at any given clock cycle, it can only“see” the instructions that were previously written into it, that have notalready dispatched. From that pool, the RS’s job is to identify and dispatchthe two eligible instructions that will most boost machine performance.

Figure L.14 Candidates for dispatch.

Figure L.15 Number of clock cycles required.

LD

DIVD

MULTD

LD

ADDD

ADDD

ADDI

ADDI

SD

SUB

BNZ

10

F2, 0(Rx)

F8,F2,F0

F2,F8,F2

F4, 0(Ry)

F4,F0,F4

F10,F8,F2

Rx,Rx,#8

Ry,Ry,#8

F4,0(Ry)

R20,R4,Rx

20,Loop

LD

DIVD

MULTD

LD

ADDD

ADDD

ADDI

ADDI

SD

SUB

BNZ

2

F2, 0(Rx)

F8,F2,F0

F2,F8,F2

F4, 0(Ry)

F4,F0,F4

F10,F8,F2

Rx,Rx,#8

Ry,Ry,#8

F4,0(Ry)

R20,R4,Rx

20,Loop

LD

DIVD

MULTD

LD

ADDD

ADDD

ADDI

ADDI

SD

SUB

BNZ

3

F2, 0(Rx)

F8,F2,F0

F2,F8,F2

F4, 0(Ry)

F4,F0,F4

F10,F8,F2

Rx,Rx,#8

Ry,Ry,#8

F4,0(Ry)

R20,R4,Rx

20,Loop

LD

DIVD

MULTD

LD

ADDD

ADDD

ADDI

ADDI

SD

SUB

BNZ

4

F2, 0(Rx)

F8,F2,F0

F2,F8,F2

F4, 0(Ry)

F4,F0,F4

F10,F8,F2

Rx,Rx,#8

Ry,Ry,#8

F4,0(Ry)

R20,R4,Rx

20,Loop

LD

DIVD

MULTD

LD

ADDD

ADDD

ADDI

ADDI

SD

SUB

BNZ

5 6

F2, 0(Rx)

F8,F2,F0

F2,F8,F2

F4, 0(Ry)

F4,F0,F4

F10,F8,F2

Rx,Rx,#8

Ry,Ry,#8

F4,0(Ry)

R20,R4,Rx

20,Loop

LD

DIVD

MULTD

LD

ADDD

ADDD

ADDI

ADDI

SD

SUB

BNZ

F2, 0(Rx)

F8,F2,F0

F2,F8,F2

F4, 0(Ry)

F4,F0,F4

F10,F8,F2

Rx,Rx,#8

Ry,Ry,#8

F4,0(Ry)

R20,R4,Rx

20,Loop

Candidates for dispatch in boldFirst 2 instructions appear in RS

alu0

ADDI Rx,Rx,#8

SUB R20,R4,Rx

MULTD F2,F8,F2

BNZ R20,Loop

Branch shadow

ADDI Ry,Ry,#8

ADDD F4,F0,F4

DIVD F8,F2,F0

ADDD F10,F8,F2

1

2

3

4

5

6

7

8

9

. . .

18

19

20

21

22

23

24

25

Clock cycle

25 clock cycles total

alu1 ld/st

LD F2,0(Rx)

LD F4,0(Ry)

SD F4,0(Ry)


d. See Figure L.16.

1. Another ALU: 0% improvement

2. Another LD/ST unit: 0% improvement

3. Full bypassing: critical path is LD -> Div -> MULT -> ADDD. Bypassingwould save 1 cycle from latency of each, so 4 cycles total

4. Cutting longest latency in half: divider is longest at 12 cycles. Thiswould save 6 cycles total.

Figure L.16 Speedup is (execution time without enhancement) / (executiontime with enhancement) = 25 / (25 – 6) = 1.316.

alu0

ADDI Rx,Rx,#8

SUB R20,R4,Rx

MULTD F2,F8,F2

BNZ R20,Loop

Branch shadow

ADDI Ry,Ry,#8

ADDD F4,F0,F4

DIVD F8,F2,F0

ADDD F10,F8,F2

1

2

3

4

5

6

7

8

9

. . .

18

19

20

21

22

23

24

25

Clock cycle

25 clock cycles total

alu1 ld/st

LD F2,0(Rx)

LD F4,0(Ry)

SD F4,0(Ry)



e. See Figure L.17.

Case Study 2: Modeling a Branch Predictor

2.13 For this exercise, please download the file “branch_predict.zip” from the instruc-tor Web site. The archive includes a php_out text file, which is the expected out-put of the C program the reader is asked to write for this exercise.

Figure L.17 Number of clock cycles required to do two loops’ worth of work.Critical path is LD -> DIVD -> MULTD -> ADDD. If RS schedules 2nd loop’s critical LD in cycle2, then loop 2’s critical dependency chain will be the same length as loop 1’s is. Sincewe’re not functional-unit-limited for this code, only one extra clock cycle is needed.

alu0

ADDI Rx,Rx,#8

SUB R20,R4,Rx

MULTD F2,F8,F2

BNZ R20,Loop

Branch shadow

ADDI Ry,Ry,#8

DIVD F8,F2,F0

ADDD F4,F0,F4

DIVD F8,F2,F0

ADDD F10,F8,F2

MULTD F2,F8,F2

ADDD F10,F8,F2

1

2

3

4

5

6

7

8

9

. . .

18

19

20

21

22

23

24

25

26

Clock cycle

26 l k l l

alu1 ld/st

LD F2,0(Rx)

LD F2,0(Rx)

LD F4,0(Ry)

SD F4,0(Ry)



Case Study: Dependences and Instruction-Level Parallelism

3.1 a. Figure L.18 shows the dependence graph for the C code in Figure 3.14. Eachnode in Figure L.18 corresponds to a line of C statement in Figure 3.14. Notethat each node 6 in Figure L.18 starts an iteration of the for loop in Figure3.14.

Since we are assuming that each line in Figure 3.14 corresponds to onemachine instruction, Figure L.18 can be viewed as the instruction-leveldependence graph. A data true dependence exists between line 6 and line 9.Line 6 increments the value of i, and line 9 uses the value of i to index intothe element array. This is shown as an arc from node 6 to node 9 in FigureL.18. Line 9 of Figure 3.14 calculates the hash_index value that is used bylines 10 and 11 to index into the element array, causing true dependencesfrom line 9 to line 10 and line 11. This is reflected by arcs going from node 9to node 10 and node 11 in Figure L.18. Line 11 in Figure 3.14 initializesptrCurr, which is used by line 12. This is reflected as a true dependence arcfrom node 11 to node 12 in Figure L.18.

Figure L.18 Dynamic dependence graph for six insertions under the ideal case.

Chapter 3 Solutions

6

9

10 11

17 18

12

6

9

10 11

17 18

12

6

9

10 11

6

9

10 11

12

6

9

10 11

17 18

12

6

9

10 11

17 18

12


Note that node 15 and node 16 are not reflected in Figure L.18. Recall that allbuckets are initially empty and each element is being inserted into a differentbucket. Therefore, the while loop body is never entered in the ideal case.

Line 12 of Figure 3.14 enforces a control dependence over line 17 and line18. The execution of line 17 and line 18 hinges upon the test that skips thewhile loop body. This is shown as control dependence arcs from node 12 tonode 17 and node 18.

Note that we have omitted some data dependence arcs from Figure L.18. Forexample, there should be a true dependence arc from node 10 to node 17 andnode 18. This is because line 10 of Figure 3.14 initializes ptrUpdate, whichis used by lines 17 and 18. These dependences, however, do not impose anymore constraints than what is already imposed by the control dependencearcs from node 12 to node 17 and node 18. Therefore, we omitted thesedependence arcs from Figure L.20 in favor of simplicity.

In the ideal case, all for loop iterations are independent of each other oncethe for loop header (node 6) generates the i value needed for the iteration.Node 6 of one iteration generates the i value needed by the next iteration.This is reflected by the dependence arc going from node 6 of one iteration tonode 6 of the next iteration. There are no other dependence arcs going fromany node in a for loop iteration to subsequent iterations. This is because eachfor loop iteration is working on a different bucket. The changes made by line18 (*ptrUpdate=) to the pointers in each bucket will not affect the insertionof data into other buckets. This allows for a great deal of parallelism.

Recall that we assume that each statement in Figure 3.14 corresponds to onemachine instruction and takes 1 clock cycle to execute. This makes thelatency of nodes in Figure L.18 1 cycle each. Therefore, each horizontal rowof Figure L.18 represents the instructions that are ready to execute at a clockcycle. The width of the graph at any given point corresponds to the amount ofinstruction-level parallelism available during that clock cycle.

b. As shown in Figure L.18, each iteration of the outer for loop has 7 instruc-tions. It iterates 512 times. Thus, 3584 instructions are executed.

The for loop takes 4 cycles to enter steady state. After that, one iteration iscompleted every clock cycle. Thus the loop takes 4 + 512 = 516 cycles to exe-cute.

c. 3584 instructions are executed in 516 cycles. The average level of ILP avail-able is 3584/516 = 6.946 instructions per cycle.

d. Figure L.19 shows the unrolled code with transformations. Note that thecross-iteration dependence on the i value calculation can easily be removedby unrolling the loop. By unrolling the loop once and changing the usage ofthe array index usage of the unrolled iteration to element[i+1], one caneliminate the sequential increment of i from even to odd iterations. Note thatthe two resulting parts of the for loop body after unrolling transformation arecompletely independent of each other. This doubles the amount of parallelism


available. The amount of parallelism is proportional to the number of unrollsperformed. Basically, with the ideal case, a compiler can easily transform thecode to expose a very large amount of parallelism.

The for loop will now only execute Line 6 once for every two original itera-tions (see transformed code). Each unrolled loop iteration executes 13instructions and there are 256 iterations. Therefore, a total of 3328 instruc-tions are executed.

It still takes 4 cycles to enter stead state and each unrolled iteration will com-plete every clock cycle starting the fifth cycle. Therefore, the loop will require4 + 256 = 260 cycles to execute.

Figure L.19 Hash table code example.

6 for (i = 0; i < N_ELEMENTS; i+=2) {7 Element *ptrCurr, **ptrUpdate;8 int hash_index;

/* Find the location at which the new element is to be inserted. */9 hash index = element[i].value & 1023;10 ptrUpdate = &bucket[hash_index];11 ptrCurr = bucket[hash_index]; /* Find the place in the chain to insert the new element. */12 while (ptrCurr &&13 ptrCurr->value <= element[i].value)14 {15 ptrUpdate = &ptrCurr->next;16 ptrCurr = ptrCurr->next; }

/* Update pointers to insert the new element into the chain. */17 element[i].next = *ptrUpdate;18 *ptrUpdate = &element[i];

9′ hash_index = element[i+1].value & 1023;10′ ptrUpdate = $bucket[hash_index];11′ ptrCurr = bucket[hash_index];12 while (ptrCurr &&13 ptrCurr->value <= element[i+1].value)14 {15 ptrUpdate = &$ptrCurr->next;16 ptrCurr = ptrCurr->next; }

/* Update pointers to insert the new element into the chain. */17 element[i+1].next = *ptrUpdate;18 *ptrUpdate = &$element[i+1]; }


The average amount of ILP is 3328/260 = 12.8. Note that the unrolling trans-formation does not quite double the ILP. This is not necessarily a bad thing. Itimproves the efficiency by eliminating the execution of Line 6 by half. In amore resource limited processor, this improved efficiency can help improveperformance.

e. Figure L.20 shows the time frame in which each of these variables needs tooccupy a register. The reader should be able to tell some variables can occupya register that is no longer needed by another variable. For example, thehash_index of iteration 2 can occupy the same register occupied by thehash_index of iteration 1. Therefore, the overlapped execution of the nextiteration uses only 2 additional registers.

Similarly the third and the fourth iteration each requires another one register.Each additional iteration requires another register. By the time the fifth itera-tion starts execution, it does not add any more register usage since the registerfor i value in the first iteration is no longer needed. As long as the hardwarehas no fewer than 8 registers, the parallelism shown in Figure 3.15 can befully realized. However, if the hardware provides fewer than 8 registers, oneor more of the iterations will need to be delayed until some of the registersare freed up. This would result in a reduced amount of parallelism.

Figure L.20 Register lifetime graph for the ideal case.

6

9

hash_index

hash_index

hash_index

10 11

17 18

12

ptrCurr

ptrUpdate

i, element [ ]

12

17 18

6

9

10 11

6

9

10 11

17 18

12

ptrCurr

ptrUpdate

i, element [ ]

ptrUpdate

i, element [ ]


f. See Figure L.21. Each iteration of the for loop has 7 instructions. In a pro-cessor with an issue rate of 2 instructions per cycle, it takes about 2 cycles forthe processor to issue one iteration. Thus, the earliest time the instructions inthe second iteration can be even considered for execution is 3 or 4 cycles afterthe previous iteration.

Figure L.22 shows the instruction issue timing of the 2-issue processor. Notethat the limited issue rate causes iterations 2 and 3 to be delayed by 2 and 3clock cycles. It causes iteration 4, however, to be delayed by 2 clock cycles.This is a repeating pattern.

Cycle

1 6 9

2 10 11

3 12 17

4 18 6

5 9 10

6 11 12

7 17 18

8 6 9

Figure L.21 Instruction issue timing.

Figure L.22 Instruction issue timing.

6

9

10 11

17 18

12

12

17 18

6

9

10 11

6

9

10 11

17 18

12


The reduction of parallelism due to limited instruction issue rate can be cal-culated based on the number of clock cycles needed to execute the for loop.Since the number of instructions in the for loop remains the same, anyincrease in execution cycle results in decreased parallelism. It takes 5 cyclesfor the first iteration to complete. After that, one iteration completes at cycles8, 12, 15, 19, 22, 26, . . . . Thus the for loop completes in 5 + 3 × 512 + 4 ×511 = 5 + 1536 + 2044 = 3585 cycles. When compared to part (b), limitingthe issue rate to 3 instructions per cycle reduces the amount of parallelism toless than one third of the original level!

g. In order to achieve the level of parallelism shown in Figure 3.15, we mustassume that the instruction window is large enough to hold all instructionsthat have been executed from the previous iterations when a new iterationbegins. This means that the processor must hold instructions 17, 18 from thefirst iteration; 12, 17, 18 from the second iteration; 10, 11, 12, and 18 of thethird iteration; as well as instructions 9, 10, 11, 12, 17, 18 of the fourth itera-tion when instruction 6 of the fifth iteration is considered for execution. If theinstruction window is not large enough, the processor would be stalled beforeinstruction 6 of the fifth iteration can be considered for execution. This wouldincrease the number of clocks required to execute the for loop, thus reducingthe parallelism. The minimal instruction window size for the maximal ILP isthus 16 instructions. Note that this is a small number. Part of the reason is thatwe picked a scenario where there is no dependence across for loop iterationsand that there is no other realistic resource constraints. The reader should ver-ify that with more realistic execution constraints, much larger instructionwindows will be needed in order to support available ILP.

3.2 a. Refer to Figures 3.14 and 3.15. The insertion of 1024 is made into the samebucket as element 0. The execution of lines 17 and 18 for element 0 can affectthe execution of lines 12, 13, 15, and 16 for element 1024. That is, the newelement 0 inserted into bucket 0 will affect the execution of the linked list tra-versal when inserting element 1024 into the same bucket. In Figure L.23, weshow a dependence arc going from node 18 of the first iteration to node 11 ofthe third iteration. These dependences did not exist in the ideal case shown inFigure L.18. We show only the dependence from node 18 of one iteration tonode 11 of a subsequent iteration for simplicity. There are similar dependencearcs to nodes 12, 13, 15, 16, but they do not add to more constraints than theones we are showing. The reader is nevertheless encouraged to draw all theremaining dependence arcs for a complete study.

b. The number of instructions executed for each for loop iteration increases asthe numbers of data elements hashed into bucket 0 increases. Iteration 1 has 7instructions each. Iteration 2 has 11 instructions. The reader should be able toverify that iterations 3 the for loop has 15 instructions each. Each successiveof iteration will have 4 more instructions each than the previous pair. Thiswill continue to iteration 512. The total number of instructions can beexpressed as the following series:


= (7 + 11 + 15 + . . . + (7 + (N – 1) × 4)) = N/2 × (14 + (N – 1) × 4))= N/2 × (4 × N + 10)= 4N2 + 5N

where N is the number of pairs of data elements.

In our case, N is 512, so the total number of dynamic instructions executed is2 × 218 + 5 × 29 = 524,288 + 2560 = 526,848. Note that there are many moreinstructions executed when the data elements are hashed into the samebucket, rather than evenly spread over the 1024 buckets. This is why it isimportant to design the hash functions so that the data elements are hashedevenly into all the buckets.

c. The number of clock cycles required to execute all the instructions can bederived by examining the critical path in Figure L.23. The critical path goes

Figure L.23 Dynamic dependence graph of the hash table code when the data ele-ment values are 0, 1, 1024, 1025, 2048, 2049, 3072, 3073, . . . .

6

9

10 11

17 18

12

6

9

10 11

17 18

12

6

9

10

11

17 18

12

15 16

13

12

6

9

10

11

17 18

12

15 16

13

12


from node 6 to nodes 9, 11, 12, 18 of iteration 1, and then to nodes 11, 12, 13,16, 12, 18 of iteration 2. Note that the critical path length contributed by eachiteration forms the following series:

5, 3 + 3, 3 + 2 × 3, . . . , 3 + (N – 1) × 3

The total length of the critical path is the sum of contributions

= 5 + (3 + 3) + (3 + 2 × 3) + (3 + 3 × 3) + (3 + (N – 1) × 3)= 6 + ((N – 2)/2) × (6 + (N + 1) × 3)

where N is the number of data elements.

In our case, N is 512, so the total number of clock cycles for executing all theinstructions in the dynamic dependence graph is

11 + 255 × (6 + (513) × 3)= 11 + 255 × (1545)= 11 + 393,975= 393,986

d. The amount of instruction-level parallelism available is the total number ofinstructions executed divided by the critical path length. The answer is

526,848/393,987= 1.33

Note that the level of instruction-level parallelism has been reduced from6.973 in the ideal case to 1.33. This is due to the additional dependences youobserved in part (a). There is an interesting double penalty when elements arehashed into the same bucket: the total number of instructions executedincreases and the amount of instruction-level parallelism decreases. In a pro-cessor, these double penalties will likely interact and reduce the performanceof the hash table code much more than the programmer expected. This dem-onstrates a phenomenon in parallel processing machines. The algorithmdesigner often needs to pay attention not only to the effect of algorithms onthe total number of instructions executed but also to their effect on the paral-lelism available to the hardware.

e. What we have in part (b) is indeed the worst case. In the worst case, all newdata elements are entered into the same bucket and they come in ascendingorder. One such sequence would be 0, 1024, 2048, 3072, 4096, . . . The levelof serialization depends on the accuracy of the memory disambiguationmechanism. In the worst case, the linked-list traversal of an iteration will notbe able to start until the linked-list updates of all previous iterations are com-plete. Such serialization will essentially eliminate any overlap between thewhile loop portions across for loop iterations. Note also that this sequencewill also cause the most number of instructions to be executed.

f. With perfect memory disambiguation, node 18 of one for loop iteration willonly affect the execution of node 12 of the last while loop iteration of thenext for loop iteration. This can greatly increase the overlap of successivefor loop iterations. Also, the number of critical path clock cycles contributed


by each for loop iteration becomes constant: 6 cycles (node 11 → 12 →13 → 16 → 12 → 18). This makes the total clock cycles as determined by thecritical path to be

5 + 6 × 1023 = 6143 cycles

Speculation allows much more substantial overlap of loop iterations and con-fers a degree of immunity to increased memory latency, as subsequent opera-tions can be initiated while previous stores remain to be resolved. In casessuch as this, where the addresses of stores depend on the traversal of largedata structures, this can have a substantial effect. Speculation always bringswith it the cost of maintaining speculative information and the risk of mis-speculation (with the cost of recovery). These are factors that should not beignored in selecting an appropriate speculation strategy.

g. The total number of instructions executed is

7 + 11 + . . . (7 + 4 × (N – 1)) = (N/2) × (14 + 4 × (N – 1))= N × (7 + 2 × (N – 1))

where N is the number of data elements.

In our case, N is 512. The total number of instructions is 1,051,136.

h. The level of instruction-level parallelism is

1,051,136/6143 = 171.1

Beyond perfect memory disambiguation, the keys to achieving such a highlevel of instruction-level parallelism are (1) a large instruction window and(2) perfect branch prediction.

i. The key to achieving the level of instruction-level parallelism in part (h) is tooverlap the while loop execution of one for loop iteration with those ofmany subsequent for loop iterations. This requires that the instruction win-dows be large enough to contain instructions from many for loop iterations.The number of instructions in each for-loop iteration increases very quicklyas the number of elements increase. For example, by the time the last elementis inserted into the hash table, the number of instructions in the for loop iter-ation will reach

7 + 4 × (N – 1) where N is 512= 2051

Since each iteration of the for loop provides only a small amount of parallel-ism, it is natural to conclude that many for loop iterations must overlap inorder to achieve instruction-level parallelism of 171.1. Any instruction win-dow less than 2051 will likely cut down the instruction-level parallelism toless than, say, 10 instructions per cycle.

j. The exit branch of the while loop will likely cause branch prediction missessince the number of iterations taken by the while loop changes with everyfor loop iteration. Each such branch prediction miss disrupts the overlapped


execution across for loop iterations. This means that the execution mustreenter the steady state after the branch prediction miss is handled. It willintroduce at least three extra cycles into total execution time, thus reducingthe average level of ILP available. Assuming that the mispredictions will hap-pen to all for loop iterations, they will essentially bring the level of instruc-tion-level parallelism back down to that of a single for loop iteration, whichwill be somewhere around 1.5.

Aggressive but inaccurate branch prediction can lead to the initiation of manyinstruction executions that will be squashed when misprediction is detected.This can reduce the efficiency of execution, which has implications for powerconsumed and for total performance in a multithreaded environment. Some-times the off-path instructions can initiate useful memory subsystem opera-tions early, resulting in a small performance improvement.

k. A static data dependence graph is constructed with nodes representing staticinstructions and arcs representing control flows and data dependences. FigureL.24 shows a simplified static data dependence graph for the hash table code.The heavy arcs represent the control flows that correspond to the iterationcontrol of the while loop and the for loop. These loop control flows allowthe compiler to represent a very large number of dynamic instructions with asmall number of static instructions.

Figure L.24 (Simplified) static dependence graph of the hash table code with con-trol flow and worst-case dependence arcs.

6

9

10 11

12

17 18

15 16

13 11

1


The worst-case dependences from one iteration to a future iteration areexpressed as dependence arcs marked with “dependence distance,” a valueindicating that the dependences go across one or more iterations. For exam-ple, there is a dependence from node 6 of one iteration of the for loop toitself in the next iteration. This is expressed with a “1” value on the depen-dence arc to indicate that the dependence goes from one iteration to the next.

The memory dependence from node 18 of one iteration of the for loop tonode 11 of a future iteration is shown as an arc of dependence distance 1.This is conservative since the dependence may not be on the immediate nextiteration. This, however, gives the worst-case constraint so that the compilerwill constrain its actions to be on the safe side.

The static dependence graph shown in Figure L.24 is simplified in that it doesnot contain all the dependence arcs. Those dependence arcs that do notimpose any further scheduling constraints than the ones shown are omittedfor clarity. One such example is the arc from node 10 to nodes 17 and 18. Thedependence does not impose any additional scheduling constraints than thearcs 11 → 12 → 13 → 16 do. They are shown as dashed lines for illustrationpurposes. The reader should compare the static dependence graph in FigureL.26 with the worst-case dynamic dependence graph since they should cap-ture the same constraints.

Case Study 1: Simple, Bus-Based Multiprocessor

4.1 a. P15.B3: (S, 118, 00, 18), read returns 18

b. P15.B0: (M, 100, 00, 48)

c. P15.B3: (M, 118, 00, 80), P1.B3: (I, 118, 00, 18)

d. P15.B1: (M, 108, 00, 80), P0.B1: (I, 108, 00, 08)

e. P0.B2: (S, 110, 00, 30), P15.B2: (S, 110, 00, 30), M[110]: (00, 30), readreturns 30

f. P1.B1: (S, 128, 00, 68), P15.B1:(S, 128, 00, 68), M[128]: (00, 30), readreturns 68

g. P0.B2: (I, 110, 00, 30), P15.B2: (M, 110, 00, 40)

4.2 a. P15: read 120 Read hitP15: read 128 Read miss, satisfied by P1’s cacheP15: read 130 Read miss, satisfied by memoryImplementation 1: 0 + 70 + 10 + 100 = 180 stall cyclesImplementation 2: 0 + 130 + 100 = 230 stall cycles

Chapter 4 Solutions


b. P15: read 118 Read miss, satisfied by memoryP15: write 110 <-- 48 Write miss, sadistically by L1’s cacheP15: write 130 <-- 78 Write miss, satisfied by memoryImplementation 1: 100 + 70 +10 + 100 = 280 stall cyclesImplementation 2: 100 + 130 +10 + 100 = 340 stall cycles

c. P1: read 110 Read miss, satisfied by P0’s cacheP1: read 108 Read miss, satisfied by P0’s cache, writeback 128P1: read 130 Read miss, satisfied by memoryImplementation 1: 70 + 10 + 70 +10 + 10 + 100 = 270 stall cyclesImplementation 2: 130 + 10 + 130 + 10 +10 + 100 = 390 stall cycles

d. P0: read 100 Read miss, satisfied by memoryP0: write 108 <-- 48 Write miss, sends invalidateP0: write 128 <-- 78 Write miss, satisfied by P1’s cacheImplementation 1: 100 + 15 + 70 +10 = 195 stall cyclesImplementation 2: 100 + 15 + 130 + 10 = 255 stall cycles

4.3 See Figure L.25.

4.4 a. P1: read 128 Read hitP15: read 128 Read miss, satisfied in P1’s cacheP0: read 128 Read miss, MSI satisfies in memory, MOSI satisfies

in P1’s cacheMSI: 0 + 70 + 10 + 100 = 180 stall cyclesMOSI: 0 + 70 + 10 + 70 + 10 = 160 stall cycles

Figure L.25 Protocol diagram.

Modified

CPU write hitCPU read hit

CPU write

Pla

ce w

rite

mis

s o

n b

us

Place

inva

lidat

e

on busCPU w

rite

Place read miss on busCPU read

Write miss or invalidatefor this block

Write miss forthis block

Read miss

Invalidate for this block

Write m

iss for this block

Writeback block; abort m

emory access

CPU writePlace invalidate on bus

Writeback block; abort memory access

Wri

teb

ack

blo

ck;

abo

rt m

emo

r y a

cces

s

Owned

Invalid Shared

CPU read hit

CPU read hit


b. P0: read 118 Read miss, satisfied in memoryP15: read 118 Read miss, satisfied in memoryP1: read 118 Read hitBoth protocols: 100 + 100 + 0 = 200 stall cycles

c. P0: write 118 <-- 80 Write miss, invalidates P1P15: read 118 Read miss, P0’s cacheP0: read 118 Read hitBoth protocols: 100 + 70 + 10 + 0 = 180 stall cycles

d. P15: write 110 <-- 90 Write miss, satisfied in P0’s cache, invalids P1

P1: read 110 Read miss, satisfied in P15’s cacheP0: write 110 <-- 98 Write miss, MSI satisfies in memory,

MOSI satisfies in P1’s cacheMSI: 70 + 10 + 70 + 10 + 100 = 260 stall cyclesMOSI: 70 + 10 + 70 + 10 + 70 + 10 = 240 stall cycles


This version of the MESI protocol does not supply data in the E state. Some ver-sions do.

Figure L.26 Diagram for a MESI protocol.

Modified


CP

U w

rite

Pla

ce w

rite

mis

s o

n b

us

CPU writ

e

Place

inva

lidat

e

on bus

Place read miss on busCPU read, other shared block

Write miss or invalidatefor this block

CPU write hit

Writ

e m

iss

for

this

blo

ckW

rite

bac

k b

lock

; ab

ort

mem

or y

acc

ess

Writ

ebac

k blo

ck;

abort

mem

ory

acce

ss

Read

miss

Excl.

Invalid Shared

CPU read hit

CPU read hit

Rea

d m

iss

Place read miss on bus

CPU read, no shares

Write m

iss or invalidate for this block


4.6 a. p0: read 100 Read miss, satisfied in memory, no sharers MSI: S, MESI: E

p0: write 100 <-- 40 MSI: send invalidate, MESI: silent transition from E to M

MSI: 100 + 15 = 115 stall cyclesMESI: 100 + 0 = 100 stall cycles

b. p0: read 120 Read miss, satisfied in memory, sharers both to Sp0: write 120 <-- 60 Both send invalidatesBoth: 100 + 15 = 115 stall cycles

c. p0: read 100 Read miss, satisfied in memory, no sharers MSI: S, MESI: E

p0: read 120 Read miss, memory, silently replace 120 from S or EBoth: 100 + 100 = 200 stall cycles, silent replacement from E

d. p0: read 100 Read miss, satisfied in memory, no sharers MSI: S, MESI: E

p1: write 100 <-- 60 Write miss, satisfied in memory regardless of protocol

Both: 100 + 100 = 200 stall cycles, don’t supply data in E state (some proto-cols do)

e. p0: read 100 Read miss, satisfied in memory, no sharers MSI: S, MESI: E

p0: write 100 <-- 60 MSI: send invalidate, MESI: silent transition from E to M

p1: write 100 <-- 40 Write miss, P0’s cache, writeback data to memory

MSI: 100 + 15 + 70 + 10 = 195 stall cyclesMESI: 100 + 0 + 70 + 10 = 180 stall cycles

4.7 a. Assume the processors acquire the lock in order. P0 will acquire it first, incur-ring 100 stall cycles to retrieve the block from memory. P1 and P15 will stalluntil P0’s critical section ends (pingponging the block back and forth) 500cycles later. P0 will stall for (about) 130 cycles while it fetches the block toinvalidate it, then P1 takes 130 cycles to acquire it. P1’s critical section is 500cycles, plus 130 to handle the write miss at release. Finally, P15 grabs theblock for a final 130 cycles of stall. So, P0 stalls for 100 cycles to acquire, 10to give it to P1, 130 to release the lock, and a final 10 to hand it off to P1 for atotal of 250 stall cycles. P1 essentially stalls until P0 releases the lock, whichwill be 100 + 500 + 10 + 130 = 840 cycles, plus 130 to get the lock, 10 to giveit to P15, 130 to get it back to release the lock, and a final 10 to hand it backto P15. This is a total of 1120 stall cycles. P15 stalls until P1 hands off thereleased lock, which will be 1120 + 130 + 10 + 500 + 130 = 1 890 cycles.Finally, P15 gets the lock 130 cycles later, so it stalls a total of 2020 cycles.

b. Test-and-test-and-set will have many fewer stall cycles than test-and-setbecause it spends most of the critical section sitting in a spin loop (whichwhile useless, is not defined as a stall cycle). Using the analysis below for thebus transactions, the stall cycles will be 3 read memory misses (300),


1 upgrade (15) and 1 write miss to a cache (130 + 10) and 1 write miss tomemory (100), 1 read cache miss to cache (130 + 10), one write miss tomemory (100), 1 read miss to cache and 1 read miss to memory (130 + 10 +100), followed by an upgrade (15) and a write miss to cache (130 + 10),finally, a write miss to cache (130 + 10) followed by a read miss to cache (130+ 10) and an upgrade (15). So approximately 1485 cycles total.

c. Approximately 10 bus transactions. The first processor to win arbitration forthe bus gets the block on its first try (1), the other two ping-pong the blockback and forth during the critical section. Because the latency is 130 cycles,this will occur about 4 times (4). The first processor does a write to releasethe lock, causing another bus transaction (1), and the second processor does atransaction to perform its test and set (1). The last processor gets the block (1)and spins on it until the second processor releases it (1). Finally the last pro-cessor grabs the block (1).

d. Approximately 15 bus transactions. Assume processors acquire the lock inorder. All three processors do a test, causing a read miss, then a test and set,causing the first processor to upgrade and the other two to write miss (6). Thelosers sit in the test loop, and one of them needs to get back a shared blockfirst (1). When the first processor releases the lock, it takes a write miss (1)then the two losers take read misses (2). Both have their test succeed, so thenew winner does an upgrade and the new loser takes a write miss (2). Theloser spins on an exclusive block until the winner releases the lock (1). Theloser first tests the block (1) then test-and-sets it, which requires an upgrade(1).

Case Study 2: A Snooping Protocol for a Switched Network

4.8 a. P15: read 118 Read miss, service in memoryP15: I --> ISAD --> ISD --> S

b. P15: write 100 <-- 48 Write miss, service in memoryP15: I --> IMAD --> IMD --> M

c. P15: write 118 <-- 80 Write miss, service in memory, invalidate P1P15: I --> IMAD --> IMD --> MP1: S --> I

d. P15: write 108 <-- 80 Write to S block, send invalidate, invalidate P0’s cache

P15: S --> SMA --> MP0: S --> I


e. P15: read 110 Read miss, service in P0’s cache, downgrade P0P15: I --> ISAD --> ISD --> SP0: M --> S

f. P15: read 128 Replace block in S, read miss, serviced in P1’s cacheP0.108: S --> IP0.128: I --> IMAD --> IMD --> MP1: M --> I

g. P15: write 110 <-- 40 Write miss, serviced in P0’s cacheP15: I --> IMAD --> IMD --> MP0: M --> I

4.9 a. P0: read 118 Read miss, service in memoryP1: write 118 <-- 40 Write to S block, invalidate P1P0: I --> ISAD --> ISD --> S --> I (note that P0 stalls the address network untilthe data arrives since it cannot process the transition to I when P1’s Invarrives)P1: S --> SMA --> M

b. P0: read 128 Read miss, service in P1’s cacheP1: write 128 <-- 60 Write to S block, invalidate P1P0: I --> ISAD --> ISD --> S --> I (note that P0 stalls the address network untilthe data arrives since it cannot process the transition to I when P1’s GetMarrives)P1: M --> S --> SMA --> M

c. P0: read 120 Read miss, service in memoryP1: read 100 Read miss, service in memoryP0: I --> ISAD --> ISD --> SP1: I --> ISAD --> ISD --> S

d. P15: read 110 Read miss, service in P0’s cacheP1: read 110 Read miss, service in memoryP0: I --> ISAD --> ISD --> SP1: I --> ISAD --> ISD --> SP15: M --> S

e. P1: read 110 Read miss, service in P0’s cacheP1: write 110 <-- 68 Write to S block, invalidate P0’s cacheP0: M --> S --> IP1: I --> ISAD --> ISD --> S --> SMA --> M

f. P0: read 118 Read miss, service in memoryP0: write 118 <-- 68 Write to S block, invalidate P1’s cacheP1: write 118 <-- 48 Write miss, service in P0’s cacheP0: I --> ISAD --> ISD --> S --> SMA --> M --> I (note that P0 stalls theaddress network until the data arrives since it cannot respond yet)P1: M --> S --> I --> IMAD --> IMD --> M


g. P15: read 128 Read miss, service in P1’s cacheP1: write 128 <-- 88 Write to S block, invalidate P15’s cacheP15.108: S --> Ip15.128: I --> ISAD --> ISD --> S --> iP1: M --> S --> SMA --> M

4.10 a. P15: read 118 Read miss, service in memoryP15.latency: Lsend_req + Lreq_msg + Lread_memory + Ldata_msg + Lrcv_data= 5 + 10 + 80 + 30 + 20 = 145P15.occupancy: Osend_req + Orcv_data = 2 + 8 = 10Mem.occupancy: Oread_memory = 20

b. P15: write 100 <-- 48 Write miss, service in memoryP15.latency: Lsend_req + Lreq_msg + Lread_memory + Ldata_msg + Lrcv_data= 5 + 10 + 80 + 30 + 20 = 145P15.occupancy: Osend_req + Orcv_data = 2 + 8 = 10Mem.occupancy: Oread_memory = 20

c. P15: write 118 <-- 80 Write miss, service in memoryP15.latency: Lsend_req + Lreq_msg + Lread_memory + Ldata_msg + Lrcv_data= 5 + 10 + 80 + 30 + 20 = 145P15.occupancy: Osend_req + Orcv_data = 2 + 8 = 10Mem.occupancy: Oread_memory = 20

d. P15: write 108 <-- 80 Write to S block, send invalidateP15.latency: Lsend_req = 5 cyclesP15.occupancy: Osend_req = 2

e. P15: read 128 Read miss, service in P1’s cacheP15.latency: Lsend_req + Lreq_msg + Lsend_data + Ldata_msg + Lrcv_data5 + 10 + 10 + 30 + 20 = 75P15.occupancy: Osend_req + Orcv_data = 2 + 8 = 10P1.occupancy: Osend_data = 8

f. P0: read 118 Read miss, service in memoryP1: read 110 Read miss, service in P0’s cache, serialize at

address networkP0.latency: Lsend_req + Lreq_msg + Lread_memory + Ldata_msg + Lrcv_data= 5 + 10 + 80 + 30 + 20 = 145P0.occupancy: Osend_req + Orcv_data + Osend_data = 2 + 8 + 8= 18Mem.occupancy: Oread_memory = 20P1.latency: Oreq_msg + Lsend_req + Lreq_msg + Lsend_data + Ldata_msg + Lrcv_data2 + 5 + 10 + 10 + 30 + 20 = 77P1.occupancy: Osend_req + Orcv_data = 2 + 8 = 10


g. P15: read 118 Read miss, service in memoryP1: write 110 <-- 78 Write miss, service in P0’s cache, serialize at

address networkP15.latency: Lsend_req + Lreq_msg + Lread_memory + Ldata_msg + Lrcv_data= 5 + 10 + 80 + 30 + 20 = 145P15.occupancy: Osend_req + Orcv_data = 2 + 8 = 10Mem.occupancy: Oread_memory = 20P1.latency: Oreq_msg + Lsend_req + Lreq_msg + Lsend_data + Ldata_msg + Lrcv_data2 + 5 + 10 + 10 + 30 + 20 = 77P1.occupancy: Osend_req + Orcv_data = 2 + 8 = 10P0.occupancy: Osend_data = 4


b. P1: read 110 Read miss, P0 responds with data to P1 and memoryP15: read 110 Read miss, satisfied from memory

P1’s read miss will force P0 to respond with data both to P1 and to memoryand then transition to state S. The memory must receive the data and respondto all future requests. Because the address network and data network are notsynchronized with each other, the memory may receive P1’s GetS message orP0’s Data message in either order. If it receives the data first, then it transi-tions to MSA and waits for P1’s GetS to complete the transition to S. ThusP15’s GetS will find the block in state S in memory. The trickier case, whichactually occurs much more frequently, occurs when P1’s GetS messagearrives before the Data. This causes the memory to transition to MSD, toawait the data. If P15’s message also arrives before the data, the memory con-troller cannot respond (since it doesn’t have the data). Instead, it must blockthe request until the data arrives (denoted by the ‘z’ in the protocol).

State GetS GetM Data

S send Data/S send Data/M err

M -/MSD — save Data/MSA

MSA -/S — err

MSD z z save Data/S

Figure L.27 Table to specify the memory controller protocol.



4.13 The Exclusive state (E) combines properties of Modified (M) and Shared (S). TheE state allows silent upgrades to M, allowing the processor to write the blockwithout communicating this fact to memory. It also allows silent downgrades to I,allowing the processor to discard its copy with notifying memory. The memorymust have a way of inferring either of these transitions. In a simple bus based sys-tem, this is typically done by having two wired-or signals, Shared and Owned,that are driven by the caches and monitored by caches and memory. If a nodemakes a request, e.g., GetS, the memory monitors the Owned signal to see if it

State Read WriteReplace-

ment OwnReq Other GetS Other GetMOther

InvOtherPutM Data

I send GetS/ISAD

send GetM/IMAD

error error — — — — error

S do Read send Inv/SMA

I error — I I — error

O do Read send InvOMA

send PutM/MIA

error send Data/O send Data/I I — error

M do Read do Write send PutM/MIA

error send Data/S send Data/I — — error

ISAD z z z ISD — — — — save Data/ISA

IMAD z z z IMD — — — — save Data/IMA

ISA z z z do Read/S — — — — error

IMA z z z do Write/M — — — — error

State Read WriteReplace-

ment OwnReq Other GetS Other GetMOther

InvOtherPutM Data

SMA z z z M — IIA IIA — error

OMA z z z M send Data/OMA

send Data/IIA IIA — error

MIA z z z send Data/I send Data/MIA

send Data/IIA — — error

IIA z z z I — — — — error

ISD z z z error — z — — save Data, do Read/S

IMD z z z error z — — — save Data,do Write/M

Figure L.28 Solution to Exercise 4.12.


should respond or not. Similarly, if no node asserts Shared, the requesting nodeknows that no other caches have copies, and it can set the state to E. In a switchedprotocol, the pipelined nature of the interconnect makes it more difficult (but notimpossible) to implement the Shared and Owned.

4.14 a. P0: write 110 <-- 80 Hit in P0’s cache, no stall cycles for either TSO or SC

P0: read 108 Hit in P0’s cache, no stall cycles for either TSOor SC

b. P0: write 100 <-- 80 Miss, TSO satisfies write in write buffer (0 stallcycles)

SC must wait until it receives the data (145 stall cycles)P0: read 108 Hit, but must wait for preceeding operation:

TSO = 0, SC = 145

c. P0: write 110 <-- 80 Hit in P0’s cache, no stall cycles for eitherTSO or SC

P0: write 100 <-- 90 Miss, TSO satisfies write in write buffer (0 stall cycles)

SC must wait until it receives the data (145 stall cycles)

d. P0: write 100 <-- 80 Miss, TSO satisfies write in write buffer (0 stall cycles)

SC must wait until it receives the data (145 stall cycles)P0: write 110 <-- 90 Hit, but must wait for preceeding operation:

TSO = 0, SC = 145

4.15 a. The default protocol handles a write miss in Lsend_req + Lreq_msg +Lread_memory + Ldata_msg + Lrcv_data = 5 + 10 + 80 + 30 + 20 = 145 stallcycles. The optimization allows the write to be retired immediately after thenode sees its own request message: Lsend_req + Lreq_msg = 5 + 10 = 15 stallcycles. Since all memory stalls directly stall an in-order core, this is an impor-tant optimization.

b. The same analysis holds for the out-of-order core. The difference is that theout-of-order core has better latency tolerance, so the impact will be reduced.Nonetheless, even an out-of-order core has trouble hiding long memory laten-cies.

Case Study 3: Simple Directory-Based Coherence

4.16 a. P1: read 110P0.B2: (S, 110, 00, 30)P1.B2: (S, 110, 00 30)M.110: (DS, {P0,P1}, 00 30)

b. P15: read 128P1.B1: (S, 128, 00 68)P15.B1: (S, 128, 00 68)M.128: (DS, {P1,P15}, 00 68)


c. P0: write 120 <-- 80P0.B0: (M, 120, 00 80)P15.B0: (I, 120, 00 20)M.120: (DM, {P0}, 00 20)

d. P15: write 120 <-- 80P15.B0: (M, 120, 00 80)M.120: (DM, {P15}, 00 20)

e. P0: read 130P1: read 110P0.B2: (S, 130, 00 30)P1.B2: (S, 110, 00 30)M.110: (DS, {P1}, 00 30)M.130: (DS, {P1}, 00 30)

f. P1: write 110 <-- 80P0: read 110P0.B2: (S, 110, 00 80)P1.B2: (S, 110, 00 80)M.110: (DS, {P0,P1}, 00 80)

g. P1: write 110 <-- 90P0: read 130P0.B2: (S, 130, 00 30)P1.B2: (S, 110, 00 90)M.110: (DM, {P1}, 00 10)M.130: (DS, {P0}, 00 30)Memory is not updated on a write to a block that is M in another cache.

4.17 a. P1: write 120 <-- 80 Send invalidate to P15

b. P1: write 110 <-- 88 Send fetch/invalidate to P1

c. P15: write 118 <-- 90 Send invalidate to P1

d. P15: write 108 <-- 98 Send invalidate to P1


4.18 See Figures L.29 and L.30.

Figure L.29 Cache states.

Figure L.30 Directory states.

Modified


Sen

d w

rite

mis

s m

essa

ge

CPU writ

e

Send in

valid

ate

mes

sage

Send read messageCPU read

Invalidate

CPU write

CPU write missWrite data backWrite miss

CPU writeSend invalidate message

Read missSend data

Fet

ch in

valid

ate

Wri

te d

ata

bac

k

Fetch

Writ

e dat

a bac

kOwned

Invalid Shared

CPU read hit

CPU read missCPU write missWrite data backRead miss

CPU read missRead miss

CPU read hit

Write data back

Fetch invalidate

Invalidate

Modified

Write missFetch invalidateData value responseSharers = {P}

Writ

e m

iss

Dat

a va

lue

rep

lyS

har

ers

= {P

}

Writ

e m

iss

Send in

valid

ate

mes

sage

to s

harer

s

Data

valu

e re

ply

Sharer

s =

{P}

Data value reply,Sharers = {P}

Read miss

Write missFetch invalidateData value responseSharers = {P}

Read missFetch; Data value replySharers = sharers + {P}

Dat

a w

rite

back

Sh

arer

s =

{}

Dat

a w

rite

back

Sh

arer

s =

shar

ers

– {P

}

Owned

Invalid Shared

Read missFetchData value responseSharers = sharers + {P}

Read missData value replySharers = sharers + {P}


4.19 The Exclusive state (E) combines properties of Modified (M) and Shared (S). TheE state allows silent upgrades to M, allowing the processor to write the blockwithout communicating this fact to memory. It also allows silent downgrades to I,allowing the processor to discard its copy with notifying memory. The memorymust have a way of inferring either of these transitions. In a directory-based sys-tem, this is typically done by having the directory assume that the node is in stateM and forwarding all misses to that node. If a node has silently downgraded to I,then it sends a NACK (Negative Acknowledgement) back to the directory, whichthen infers that the downgrade occurred. However, this results in a race with othermessages, which can cause other problems.

Case Study 4: Advanced Directory Protocol

4.20 a. P15: read 118P0: I --> ISD --> SDir: DS {P1} --> DS {P1,P15}Dir: DI --> DS {P0}

b. P15: write 100 <-- 48P15.120: S --> IP15.100: I --> IMAD --> IMA --> MDir: DI --> DM {P15}

c. P15: write 118 <-- 80P0: I --> IMAD --> IMA --> MP1: S --> IDir: DS {P1} --> DM {P15}

d. P15: write 108 <-- 80P15: S --> IMAD --> IMA --> MDir: DS {P0,P15} --> DM {P15}

e. P15: read 110P15: I --> ISD --> SP0: M --> MSA --> SDir: DM {P0} --> DS {P0,P15}

f. P15: read 128P15.108: S --> IP15.128: I --> IMAD --> IMA --> MDir: DM {P1} --> DS {P1,P15}

g. P15: write 110 <-- 40P15: I --> IMAD --> IMA --> MP0: M --> IDir: DM {P0} --> DM {P15}

4.21 a. P0: read 118P15: read 118P0: I --> ISD --> SP1: I --> ISD --> SDir: DS {P1} --> DS {P0,P1} --> DS {P0,P1,P15}


b. P0: read 128P15: write 128 <-- 60P0: I --> ISD --> ISID --> IP15: I --> IMAD --> IMA --> MP1: M --> IDir: DM {P1} --> DMSD{P0},{P1} --> DS {P0,P1} --> DM {P15}

c. P1: read 120P15: write 120 <-- 80P0: I --> ISD --> ISID --> IP1: S --> IMAD --> IMA --> MP15: S --> IDir: DS {P15} --> DM {P0} --> DMSD{P0},{P1} --> DS {P0,P1}

d. P15: write 118 <-- 100P0: write 118 <-- 110P15: I --> IMAD --> IMA --> M --> IP0: I --> IMAD --> IMA --> MP1: S --> IDir: DS {P1} --> DM {P15} --> DM {P0}

e. P15: write 110 <-- 58P1: write 110 <-- 68P15: I --> IMAD --> IMA --> M --> IP1: I --> IMAD --> IMA --> MDir: DM {P0} --> DM {P15} -->DM {P1}

f. P0: read 118P1 write 118 <-- 68P0: I --> ISD --> ISID --> IP1: S --> IMAD --> IMA --> MDir: DS {P1} --> DS {P0,P1} --> DM{P1}

g. P15: read 128P1: replace 128P1: M --> MIA --> MIA --> IP15: I --> IMAD --> IMA --> MDir: DM {P1} --> DMSD{P15},{P1} --> DS {P1}

4.22 a. P15: read 100 Miss, satisfied in memoryP15.latency: Lsend_msg + Lreq_msg + Lread_memory + Ldata_msg + Lrcv_data= 6 + 15 + 100 + 30 + 15 = 166

b. P15: read 128 Miss, satisfied in P1’s cacheP15.latency: Lsend_msg + Lreq_msg + Lsend_msg + Lreq_msg + Lsend_data + Ldata_msg + Lrcv_data = 6 + 15 + 6 + 15 + 20 + 30 + 15 = 107

c. P15: write 118 <-- 80 Miss, satisfied in memoryP15.latency: Lsend_msg + Lreq_msg + Lread_memory + Ldata_msg + Lrcv_data = 6 + 15 + 100 + 30 + 15 = 166


d. P0: replace 110P0.latency: Lsend_data + Ldata_msg + Lwrite_memory + Lsend_msg + Lreq_msg + Lack = 20 + 30 + 20 + 6 + 15 + 4 = 95

e. P1: write 110 <-- 80 Miss, satisfied from P0’s cacheP1.latency: Lsend_msg + Lreq_msg + Lsend_msg + Lreq_msg + Lsend_data + Ldata_msg + Lrcv_data = 6 + 15 + 100 + max(1 + 15 + 6 + 15, 30 + 15) + 4 = 125 + max(37,45) = 170

f. P15: write 108 <-- 70 Write to S block, downgrade to I and send GetMP15.latency: Lsend_msg + Lreq_msg + Lread_memory + max(Linv + Lreq_msg + Lsend_msg + Lreq_msg, Ldata_msg + Lrcv_data) + Lack = 6 + 15 + 100 + max(1 + 15 + 6 + 15, 30 + 15) + 4 =125 + max(37,45) = 170

4.23 All protocols must ensure forward progress, even under worst-case memoryaccess patterns. It is crucial that the protocol implementation guarantee (at leastwith a probabilistic argument) that a processor will be able to perform at least onememory operation each time it completes a cache miss. Otherwise, starvationmight result. Consider the simple spin lock code:

tas: DADDUI R2, R0, #1lockit: EXCH R2, 0(R1)

BNEZ R2, lockit

If all processors are spinning on the same loop, they will all repeatedly issueGetM messages. If a processor is not guaranteed to be able to perform at least oneinstruction, then each could steal the block from the other repeatedly. In the worstcase, no processor could ever successfully perform the exchange.

4.24 a. The MSA state is essentially a “transient O” because it allows the processor toread the data and it will respond to GetShared and GetModified requests fromother processors. It is transient, and not a real O state because memory willsend the PutM_Ack and take responsibility for future requests.


b. MOSI Protocol

State Read WriteReplace-ment INV

Forwarded_GetS

Forwarded_GetM

PutM_Ack Data Last ACK

I send GetS/IS

send GetM/IM

error send Ack/I

error error error error error

S do Read send GetM/IM

I sendAck/I

error error error error error

O do Read send GetM/OM

send PutM/OI

error send Data send Data/I error — —

M do Read do Write send PutM/MI

error send Data/O send Data/I error error error

IS z z z send Ack/ISI

error error error save Data, do Read/S

error

ISI z z z send Ack error error error save Data, do Read/I

error

IM z z z send Ack IMO IMIA error save Data do Write/M

IMI z z z error error error error save Data do Write,send Data/I

IMO z z z send Ack/IMI

— IMOI error save Data do Write,send Data/O

IMOI z z z error error error error save Data do Write,sendData/I

OI z z z error send Data send Data /I error error

MI z z z error send Data send Data /I error error

OM z z z error send Data send Data/IM error save Data do Write/M

Figure L.31 MESI Directory cache controller transitions.


4.25 a. P1: read 110P15: write 110 <-- 90

In this problem, both P1 and P15 miss and send requests that race to thedirectory. Assuming that P1’s GetS request arrives first, the directory will for-ward P1’s GetS to P0, followed shortly afterwards by P15’s GetM. If the net-work maintains point-to-point order, then P0 will see the requests in the rightorder and the protocol will work as expected. However, if the forwardedrequests arrive out of order, then the GetX will force P0 to state I, causing itto detect an error when P1’s forwarded GetS arrives.

b. P1: read 110P0: replace 110

P1’s GetS arrives at the directory and is forwarded to P0 before P0’s PutMmessage arrives at the directory and sends the PutM_Ack. However, if thePutM_Ack arrives at P0 out-of-order (i.e., before the forwarded GetS), thenthis will cause P0 to transition to state I. In this case, the forwarded GetS willbe treated as an error condition.

Case Study 1: Optimizing Cache Performance via Simple Hardware

5.1 a. The access time of the direct-mapped cache is 0.55 ns, while the 2-way and4-way are 0.77 ns and 0.75 ns respectively. This makes the relative accesstimes 0.77/0.55 = 1.40 or 40% more for the 2-way and 0.75/0.55 = 1.36 or36% more for the 4-way.

State GetS GetMPutM(owner)

PutM(nonowner)

DI send Data, add to sharers/DS

send Data, clear sharers, set owner/DM

error send PutM_Ack

DS send Data, add to sharers

send INVs to sharers, clear sharers, set owner, send Data/DM

error send PutM_Ack

DO forward GetS, add to sharers

forward GetM, send INVs to sharers, clear sharers, set owner/DM

save Data, send PutM_Ack/DS

send PutM_Ack

DM forward GetS, add to requester and owner to sharers/DO

forward GetM, send INVs to sharers, clear sharers, set owner

save Data, send PutM_Ack/DI

send PutM_Ack

Figure L.32 Directory controller transitions.

Chapter 5 Solutions


b. The access time of the 16 KB cache is 0.70 ns, while the 32 KB and 64 KBare 0.72 ns and 0.75 ns respectively. This makes the relative access times0.72/0.70 = 1.03 or 3% larger for the 32 KB and 0.75/0.70 = 1.07 or 7%larger for the 64 KB

c. The organization with the lowest average memory access time is a 32 KB 2-way set associative cache. Caches with lower miss rates have access timeslarger than 0.75 ns.

5.2 a. The average memory access time of the current cache is 0.0037 × 20 + (1 –0.0037) = 1.07 cycles. The AMAT of the way-predicted cache has three com-ponents: miss, hit with way prediction correct, and hit with way predictionmispredict: 0.0037 × 20 + (0.80 × 1 + (1 – 0.80) × 2) × (1 – 0.0037) = 1.27.

b. The access time of the 64 KB 2-way cache is 0.77 ns, while the 16 KB direct-mapped cache can be accessed in 0.47 ns. This provides 0.77/0.47 = 1.64 or64% faster execution.

c. With 1 cycle way misprediction penalty, AMAT is 1.27 (as per part a), butwith a 15 cycle misprediction penalty, the AMAT becomes: 0.0037 × 20 +(0.80 × 1 + (1 – 0.80) × 15) × (1 – 0.0037) = 3.86 cycles, for an increase of3.86 – 1.27 = 2.59!

d. The serial access is 2.37 ns/2.28 ns = 1.039 or 3.9% slower, but dissipates0.085nJ/0.128nJ = 0.66 or 34% less energy.

5.3 a. The access time is 0.773 ns, while the cycle time is 0.296 ns, which could bepotentially pipelined as finely as 0.773/0.296 = 2.61 pipestages.

b. The baseline AMAT with deeper pipelining becomes 0.00367 × 40 + (1 –0.00367) × 1 = 1.14. Adding another cycle 15% of the time would make it1.29.

c. The pipelined design (not including latch area and power) has an area of 0.61mm2 and energy per access of 0.024nJ. The banked cache has approximatelytwice the area and energy per access of the 32 KB 2-way cache, or an area of0.63 mm2 and energy per access of 0.020nJ. The banked design uses slightlymore area because it has more sense amps and other circuitry to support thetwo banks, while the pipelined design burns slightly more power because thememory arrays that are active are larger than in the banked case.

5.4 With critical word first, the miss service would require 120 cycles. Without criti-cal word first, it would require 120 cycles for the first 16 B and 16 cycles for eachof the next 3 16 B blocks, or 120 + (3 × 16) = 168 cycles.

5.5 a. Assume merging write buffer entries are 16 B wide. Since each store canwrite 8B, a merging write buffer entry would fill up in 2 cycles. The level-2cache will take 4 cycles to write each entry. A non-merging write bufferwould take 4 cycles to write the 8B result of each store. This means the merg-ing write buffer would be 2 times faster.


b. With blocking caches, the presence of misses effectively freezes progressmade by the machine, so whether there are misses or not doesn’t change therequired number of write buffer entries. With nonblocking caches, writes canbe processed from the write buffer during misses, which may mean fewerentries are needed.

Case Study 2: Optimizing Cache Performance via Advanced Techniques

5.6 a. Each block can take up half the cache, or 32 KB. This provides 32 KB/8B =4K elements. So blocks should be 64 × 64 = 4K elements.

b. 2-way set associative. In a direct-mapped cache the blocks could be allocatedso that they map to overlapping regions in the cache.

c. You should be able to determine the level-1 cache size by varying the blocksize. The ratio of the blocked and unblocked program speeds for arrays thatdo not fit in the cache in comparison to blocks that do is a function of thecache block size, whether the machine has out-of-order issue, and the band-width provided by the level-2 cache. You may have discrepancies if yourmachine has a write-through level-1 cache and the write buffer becomes alimiter of performance.

5.7 Since the unblocked version is too large to fit in the cache, processing eight 8Belements requires fetching one 64B row cache block and 8 column cache blocks.Since each iteration requires 2 cycles without misses, prefetches can be initiatedevery 2 cycles, and the number of prefetches per iteration is more than one, thememory system will be completely saturated with prefetches. Because thelatency of a prefetch is 16 cycles, and one will start every 2 cycles, 16/2 = 8 willbe outstanding at a time.

5.8 a. Solutions will vary.

b. Solutions will vary.

Case Study 3: Main Memory Technology and Optimizations

5.9 a. A 1 GB DRAM with parity or ECC effectively has 9 bit bytes, and wouldrequire 9 1Gb DRAMs. To create 72 output bits, each one would have to out-put 72/9 = 8 bits.

b. A burst length of 8 reads out 32B.

c. The DDR2-667 DIMM bandwidth is 667/533 = 1.25 or 25% for reads toactive pages. The DDR2-667 DIMM costs 130/100 = 1.30 or 30% more.

d. This is similar to the scenario given in the figure, but tRCD and CL are both5. Thus it requires 5 + 5 + 4 = 14 cycles of a 333 MHz clock, or 14 × (1/333MHz) = 42.0 ns.


e. The read to an open bank requires 5 + 4 = 9 cycles of a 333 MHz clock, or27.0 ns. In the case of a bank activate, this is 14 cycles, or 42.0 ns. Including20 ns for miss processing on chip, this makes the two 42 + 20 = 61 ns and27.0 + 20 = 47 ns. Including time on chip, the bank activate takes 61/47 =1.30 or 30% longer.

5.10 The costs of the two systems are $2 × 130 + $800 = $1060 with the DDR2-667DIMM and 2 × $100 + $800 = $1000 with the DDR2-533 DIMM. The latency toservice a level-2 miss is 14 × (1/333MHz) = 42ns with the DDR2-667 DIMM and12 × (1/266MHz) = 45 ns with the DDR-533 DIMM. The CPI added by the level-2 misses in the case of DDR2-667 is 0.00333 × 42 = 0.140 giving a total of 1.5 +0.140 = 1.64. Meanwhile the CPI added by the level-2 misses for DDR-533 is0.00333 × 45 = 0.150 giving a total of 1.5 + 0.150 = 1.65. Thus the difference isonly 1.65/1.64 = 1.006, or 0.6%, while the cost is $1060/$1000 = 1.06 or 6.0%greater. The cost/performance of the DDR2-667 system is 1.64 × 1060 = 1738while the cost/performance of the DDR2-533 system is 1.65 × 1000 = 1600, sothe DDR2-533 system is a better value

5.11 The cores will be executing 8cores × 3GHz/2.0CPI = 12 billion instructions persecond. This will generate 12 × 0.00667 = 80 million level-2 misses per second.With the burst length of 8, this would be 80 × 32B = 2560 MB/sec. If the memorybandwidth is sometimes 2X this, it would be 5120 MB/sec. From Figure 5.14,this is just barely within the bandwidth provided by DDR2-667 DIMMs, so justone memory channel would suffice.

5.12 The system built from 1 Gbit DRAMs will have twice as many banks as the sys-tem built from 2 Gbit DRAMs. Thus the 1 Gbit-based system should providehigher performance since it can have more banks simultaneously open.

Case Study 4: Virtual Machines

5.13 a. Yes. The application and production environment can be run on a VM hostedon a development machine.

b. Yes. Applications can be redeployed on the same environment on top of VMsrunning on different hardware. This is commonly called business continuity.

c. No. Depending on support in the architecture, virtualizing I/O may add sig-nificant or very significant performance overheads.

d. Yes. Applications running on different virtual machines are isolated fromeach other.

e. Yes. See “Devirtualizable virtual machines enabling general, single-node,online maintenance,” David Lowell, Yasushi Saito, and Eileen Samberg, inthe Proceedings of the 11th ASPLOS, 2004, pages 211–223.


5.14 a. Programs that do a lot of computation but have small memory working setsand do little I/O or other system calls.

b. The slowdown above was 60% for 10%, so 20% system time would run120% slower.

c. The median slowdown using pure virtualization is 10.3, while for para virtu-alization the median slowdown is 3.76.

d. The null call and null I/O call have the largest slowdown. These have no realwork to outweigh the virtualization overhead of changing protection levels,so they have the largest slowdowns.

5.15 The virtual machine running on top of another virtual machine would have toemulate privilege levels as if it was running on a host without VT-x technology.

5.16 a. As of the date of the Computer paper, AMD SVM adds more support forvirtualizing virtual memory, so it could provide higher performance formemory-intensive applications with large memory footprints.

b. Both provide support for interrupt virtualization, but AMD’s IOMMU alsoadds capabilities that allow secure virtual machine guest operating systemaccess to selected devices.

Case Study 5: Putting It All Together: Highly Parallel Memory Systems

5.17 a. The second-level cache is 1 MB and has a 128B block size.

b. The miss penalty of the second-level cache is approximately 105 ns.

c. The second-level cache is 8-way set associative.

d. The main memory is 512 MB.

e. Walking through pages with a 16B stride takes 946ns per reference. With 250such references per page, this works out to approximately 240 ms per page.

5.18 a. Hint: This is visible in the graph above as a slight increase in L2 miss servicetime for large data sets, and is 4 KB for the graph above.

b. Hint: Take independent strides by the page size and look for increases inlatency not attributable to cache sizes. This may be hard to discern if theamount of memory mapped by the TLB is almost the same as the size of acache level.

c. Hint: This is visible in the graph above as a slight increase in L2 miss servicetime for large data sets, and is 15 ns in the graph above.

d. Hint: Take independent strides that are multiples of the page size to see if theTLB if fully-associative or set-associative. This may be hard to discern if theamount of memory mapped by the TLB is almost the same as the size of acache level.


5.19 a. Hint: Look at the speed of programs that easily fit in the top-level cache as afunction of the number of threads.

b. Hint: Compare the performance of independent references as a function oftheir placement in memory.

5.20 Solutions will vary.

Case Study 1: Deconstructing a Disk

6.1 The answers to this exercise can be determined from Figure L.33.

a. 4.0 ms; determined at point 3 (P3) in the graph.

b. 8.33 ms; determined at point 1 (P1) in the graph, or (more robustly) the y-axisdifference between line 1 (L1) and line 2 (L2). The simplest way to thinkabout this is to consider the transition from P2 to P3. At P2, the algorithmincurs the time of a full rotation plus the minimum transfer time (worst case),and at P3, just the minimum transfer time (best case). The difference betweenthe two cases is the rotation time.

c. 1.0 ms; determined by the difference between L2 and L3. When skippingover larger distances, occasionally a head switch must take place to move usonto a different track. Hence, these show up as small blips (on L3) just abovethe baseline (L2).

Figure L.33 Results from running Skippy on Disk Gamma.

Chapter 6 Solutions

Tim

e (m

s)

14

12

10

0

6

4

2

8

0

Distance (sectors)

25020015010050

L1

P1

P2

H

P3

L4 L3 L2


d. 3.0 ms; determined by the difference between L2 and L4. Similar to above,when a cylinder is exhausted, a switch must occur to the next one. Theseshow up as (larger) blips (on L4) above the baseline (L2).

e. 5 heads; determined by counting the number of head switches that occurbetween the first two cylinder switches (labeled H on the solutions graph).

6.2 The answers to this exercise can be determined from Figure L.34.

The graph would look something like that in Figure L.34.

The minimal transfer time determines the y-axis value of point 3 (P3). The mini-mal transfer time divided by the rotational latency determines the x-axis value ofpoint 3, as it reflects the number of sectors traversed in that time. The slope of thelines L1 . . . L4 is determined by dividing the rotational time by the number ofsectors per track; L2 is the base line and essentially goes through the y-axis at x =0, L3 is just a head switch time above L2, and L4 a cylinder switch time aboveL2. The most difficult aspect of drawing the graph, then, is determining exactlywhere each point goes. A simple algorithm that counts how many sectors havebeen traversed by each skip is needed; when the total exceeds the number of sec-tors per track, one must plot a point on the head switch line (L2); when all sur-faces have been used, a cylinder switch must take place (L3); otherwise, eachskip simply takes a little longer rotationally and hence must stay on L1.


Figure L.34 Results from running Skippy on Disk Delta.

0

2

4

6

8

10

12

14

0 50 100 150 200 250 300 350 400

Tim

e (m

s)

Distance (sectors)

L1 L4L3L2

P1

P2

P3

H


Case Study 2: Deconstructing a Disk Array

6.4 a. � 128 KB, as indicated by the highest time.

� These times are determined by the level of parallelism being exercised bythe workload. The higher the number, the lower the parallelism.

� 8. There are three peaks above the baseline, the highest at 128 KB. Be-cause you know that at 128 KB, you have 1 disk, you know that at 64 KB,you have 2 disks, 4 disks at 32 KB, and thus 8 disks at the baseline.

� 16 KB. With 8 disks and a repeating pattern at 128 KB, division yields theanswer.

b. Your graph should look similar to Figure 6.26, but with a peak time of 4 s(1000 requests, each taking 4 ms). The peaks would occur at the same places,but with times of 2 s, 1 s, and 0.5 s.

6.5 a. � 4 KB. We see dips every 4 KB, which corresponds to the requests gettingspread across two drives (neighbors) instead of landing on the same disk.

� Once again, they reflect the difference in parallelism; the higher time im-plies requests directed to one disk, the lower time means two.

b. Dips would occur every 12 KB, when boundaries are detected. The height ofthe peaks would be 3 s, the valleys would be half that.

6.6 a. 6; divide pattern size by chunk size.

b. All distinct chunks appear to be allocated on different disks.

c. 6 disks; 6 chunks, and each is on a different disk.

d. Seems like simple striping, 0, 1, 2, . . . , 5 across disks, and continuing simi-larly across rows.

6.7 The graph would be 4 by 4, as the pattern size is 16 (matching the stripe size).Within the pattern, no chunk pairs collide, and hence only the diagonal wouldhave “high” values (and hence be lighter shading): (0, 0), (1, 1), . . , (3, 3). Hence,this diagram looks just like RAID 0.

Case Study 3: RAID Reconstruction


6.9 Assuming independent failures for the six disks and an MTTF for each disk of1.2 M hours (shown in Figure 6.3), the expected time until a failure is

6.10 a. To reconstruct the failed disk, the RAID 10 array must read the data blocksfrom the relevant mirror and write those blocks to the reconstructed disk.Note that the read of block i + 1 can proceed in parallel with the write toblock i.

1.2 M hours6

----------------------------- 200,000 hours=


b. The write bandwidth of the disk is the limiting factor in this case. The limit-ing bandwidth is thus 85 MB/sec. The time to read and write 37 GB at 85MB/sec requires time

= 446 seconds = 7.43 minutes = 0.124 hours.

c. With a 10 MB/sec limit, the bandwidth delivered to/from each disk is

Reading or writing 37 GB at 5 MB/sec requires time

= 7578 seconds = 126 minutes = 2.1 hours

6.11 a. Probability, x, of a second failure during offline reconstruction:

b.

c. Probability, x, of a second failure during on-line reconstruction:

d.

6.12 Assume that the bus does not limit the IOPS for a random workload. With offlinereconstruction, the performance during reconstruction is 0.

= Probabilitynofailures × 285 IOPS × 3 disks

× 855 IOPS

= 854.99947

6.13 The performability of offline reconstruction is slightly better:

= Probabilitynofailures × 285 IOPS × 3 disks + Probability1failure × (0.70 × 285 IOPS + 285 IOPS × 2 disks)

× 855 IOPS + × 769.5 IOPS

= 854.99102 + 0.0080798

= 854.9991

37 GB85 MB/sec--------------------------

10 MB/sec2

-------------------------- 5 MB/sec=

37 GB5 MB/sec-----------------------

11.2 M hours----------------------------- x

0.124 hours----------------------------=

x 1 10–7⋅=

Offine MTDL 200,000 hours

1 10–7⋅---------------------------------- 2 × 1012= = hours

11.2 M hours----------------------------- x

2.1 hours----------------------=

x 1.8 × 10–6=

Online MTDL 200,000 hours

1.8 × 10–6---------------------------------- 1.1 × 1011= = hours

1 0.124 hours200,000 hours----------------------------------–

=

1 2.1 hours200,000 hours----------------------------------–

=2.1 hours

200,000 hours----------------------------------


6.14 To construct the missing blocks, one should first read all of the blocks from thefour available disks in order to obtain the best sequential read bandwidth. Then,only after all of the missing blocks have been reconstructed should those blocksbe written out to the two repaired disks, to again obtain the best sequential writebandwidth. The reconstruction computations can be performed in the followingorder (note that other orderings are also possible):

Disk 2, Block 2: Reconstruct using diagonal parity.

Disk 3, Block 3: Reconstruct using row parity.

Disk 0, Block 3: Reconstruction using diagonal parity.

Disk 3, Block 4: Reconstruction using row parity.


Disk 3, Block 0: Reconstruction using row parity.


Disk 3, Block 1: Reconstruction using row parity (could have used diagonal par-ity from the beginning as well).

Case Study 4: Performance Prediction for RAIDs

6.15 a. 5 ms

b. Utilization = = = 83%

c.

d.

e. Wait time + Service time = 24.4 ms + 5 ms = 29.4 ms

6.16 a. A RAID 0 array with N disks is treated as N independent M/M/1 queues.Note that with more disks, the seek time per disk improves, since less data isallocated to it. We assume that the request rate to each independent disk is 1/N-th the original rate.

b. Average service time: 5.0 ms/N:

N = 1 : 5.0 ms

N = 2 : 2.5 ms

N = 4 : 1.25 ms

N = 8 : 0.625 ms

Arrival rate × Service time 167 requests/sec × 5 ms

Wait time Time server= × Util

1 Util–------------------ 5 ms ×

0.831 0.83–------------------- 24.4 ms= =

Util2

1 Util–------------------ 4.05=


Utilization:

N = 1 : 167 × 5 = 83%

N = 2 : 167/2 × 2.5 = 21%

N = 4 : 167/4 × 1.25 = 5.2%

N = 8 : 167/8 × 0.625 = 1.3%

Wait time:

N = 1 : 5.0 × = 24.4 ms

N = 2 : 2.5 × = 0.665 ms

N = 4 : 1.25 × = 0.069 ms

N = 8 : 0.625 × = 0.0082 ms

Response time:

N = 1 : 24.4 ms + 5.0 ms = 29.4 ms

N = 2 : 0.665 ms + 2.5 ms = 3.165 ms

N = 4 : 0.069 + 1.25 ms = 1.319 ms

N = 8 : 0.0082 ms + 0.625 = 0.6332 ms

6.17 a. For a read-only workload, each random request is destined for a distinct datadisk; therefore, we can treat the system as N – 1 independent M/M/1 queues,with a request rate of 1/(N – 1) of the original rate.

b. Average service time: 5.0 ms/(N – 1)

N = 4, N – 1 = 3 : 1.667 ms

N = 8, N – 1 = 7 : 0.714 ms

Utilization:

N = 4 : 167/3 × 1.667 = 9.27%

N = 8 : 167/7 × 0.714 = 1.70%

Wait time:

N = 4 : 1.667 × = 0.171 ms

N = 8 : 0.714 × = 0.012 ms

.831 .83–----------------

.211 .21–----------------

.0521 .052–-------------------

.0131 .013–-------------------

.0931 .093–-------------------

.0171 .017–-------------------


Response time:

N = 4 : 0.171 ms + 1.667 ms = 1.838 ms

N = 8 : 0.0082 ms + 0.714 = 0.722 ms

6.18 a. For a write-only workload, each random request must update the parity blockwithin the corresponding stripe; as a result, the single parity disk becomes thebottleneck of the system and determines the performance of the system. Thus,we can model the system as a single M/M/1 queue with a request rate equal tothe original rate. The only difference with more disks is that the service timedecreases, because less data is allocated to each disk.

b. Average service time: 5.0 ms/(N – 1):

N = 4, N – 1 = 3 : 1.667 ms

N = 8, N – 1 = 7 : 0.714 ms

Utilization:

N = 4 : 167 × 1.667 = 27.8%

N = 8 : 167 × 0.714= 11.9%

Wait time:

N = 4 : 1.667 × = 0.642 ms

N = 8 : 0.714 × = 0.096 ms

Response time:

N = 4 : 0.642 ms + 1.667 ms = 2.309 ms

N = 8 : 0.096 ms + 0.714 = 0.81 ms

Case Study 5: I/O Subsystem Design

6.19 If reliability is not a concern, then RAID 0 gives the best capacity and perfor-mance; with RAID 0, we waste no space for redundancy to recover from failuresand each independent disk can be used to handle a random I/O request. Largerblock sizes amortize the positioning costs, while smaller blocks sizes ensure thatonly needed data is actually transferred; therefore, the block size should roughlymatch the request size of 16 KB

6.20 In our system, we want to purchase as many disks as possible. To best use ourbudget of $18,000, we therefore maximize the number of disks, given the follow-ing constraints:

SystemCost = CPUCost + CntrlerCost × CntrlerNum + EnclCost × EnclNum + DiskCost × DiskNum

DiskNum ≤ EnclNum × 8

DiskNum ≤ CntrlerNum × 15

18,000 = 1000 + 250 × DiskNum/15 + 2000 × DiskNum/8 + 150 × DiskNum

DiskNum =

.2781 .278–-------------------

.1191 .119–-------------------

27,000417

---------------- 40.7=


Since the number of each component must be an integer, we choose

DiskNum = 40, CntrlerNum = 3,

and

EnclNum = 5

a. Each disk can deliver 285 IOPS. The 50,000 IOPS controller is not the bottle-neck of the system (assuming 15 disks or fewer per string). If all disks areoperating concurrently on the random requests, and the disks are the bottle-neck of the system, then the storage system can deliver 285 × 40 = 11,400IOPS.

b. 1000 + 250 × 3 + 2000 × 5 + 150 × 40 = 17,750

Note that even though we have $150 remaining, it cannot be used for anotherdisk, because we do not have space in any of the enclosures.

c. 40 × 37 GB = 1.45 TB

d.

Failure rate =

= 0.0000983

MTTF =


6.22 We now need the number of controllers to be equal to the number of enclosures.To maximize reliability, we want the fewest number of components, such that wemeet our capacity constraints. This implies we need

= 27.6 = 28 disks

containing usable data, but we will need more disks due to the RAID 5 parityblocks. The number of disks in each orthogonal RAID 5 array determines exactlyhow much space we end up wasting on parity. Given that eight disks fit in a chain,we need (at least) 4 enclosures and 4 controllers for the 28 disks; this actuallygives us room for 32 disks. However, a RAID array of with 4 disks only gives ususable space of 3/4 × 32 × 37 GB = 888 GB. Therefore, we need (at least) 5enclosures and controllers, giving us room for 40 disks; we can verify that 4/5 ×40 × 37 GB = 1.16 TB.

This goes slightly over our cost constraints:

1000 + 250 × 5 + 2000 × 5 + 150 × 40 = 18250

We must therefore have only 38 disks; thus, six of RAID 5 arrays will contain 5disks and two of the RAID 5 arrays will contain 4 disks.

11,000,000------------------------ 3

1,000,000------------------------ 3

500,000------------------- 5

1,000,000------------------------ 5

200,000------------------- 5

200,000------------------- 40

1,200,000------------------------+ + + + + +

1Failure rate--------------------------- 10,169 hours=

1 TB37 GB---------------


a.

Failure rate =

= 0.0001027

MTTF = = 9740 hours

b. To lose data in this system, we must lose two disks in the same RAID 5 arraybefore the one is recovered (which requires 24 hours). Note that the failure ofan enclosure or a controller will not lead to permanent data loss. Theexpected time before one disk fails is:

= 31,578 hours

After one disk fails, if any of the four other disks in the RAID 5 array fail, wewill lose data.

Therefore,

MTDL =

c. 6 × 4/5 × 5 × 37 GB + 2 × 3/4 × 4 × 37 GB = 888 + 222 = 1.08 TB

d. 1000 + 250 × 5 + 2000 × 5 + 150 × 38 = 17,950

e. For a read-only workload, every disk can be working concurrently, deliveringa total of 285 IOPS × 38 = 10,830 IOPS.

6.23 Small writes (i.e., a write to an amount smaller than the stripe size), performpoorly on RAID 5 arrays, because the RAID must update the parity block; whenonly a partial stripe is written, the RAID must read the old value of the parityblock in order to calculate the new value of the parity block. If the average size ofthe write requests increases to the blocksize × 4, then performance wouldimprove.

Case Study 6: Dirty Rotten Bits

6.24 a. A half full file system of 37 GB has 18.5 GB of data. We need only check-sums to detect single errors (no correction here), and therefore need a 20-byteMD5 for every 4 KB of each file. We don’t need to look at the file distributionhere, because we know we have 18.5 GB of data in 4 KB blocks (roughly).Hence, we have

blocks, or 2,424,832 blocks

11,000,000------------------------ 5

1,000,000------------------------ 5

500,000------------------- 5

1,000,000------------------------ 5

200,000------------------- 5

200,000------------------- 38

1,200,000------------------------+ + + + + +

1Failure rate---------------------------

1,200,00038

------------------------

41.2M hours ----------------------------- x

24 hours--------------------=

x 8 × 10 5–=

31,587 hours

8 × 10 5–------------------------------- 3.95 × 108= hours

18.5 GB8 KB

--------------------


In terms of size, that is 92.5 MB of checksum data (4 bytes per 8 KB block).

Another way to get this: simply multiply the total data size (18.5 GB) by the

additional fractional amount that the checksum will add per block .

As a percentage, this is 0.05% of the volume (4 bytes extra per data block, or

.

b. The answer here requires us to add a mirror block for every block in each file.Thus, we simply take the 18.5 GB and double it—it is that simple (by per-centage this is 100% of the data size).

c. The file size distribution does not determine the checksum overhead. Thatsaid, bigger blocks mean less protection but with less overhead; if the 0.05%overhead for 4-byte checksums is too much, it is easy to increase the blocksize to reduce this number. If comparing to the space cost of recovery, themirroring protection information takes up much more space (by a factor of athousand), so perhaps it is easy to justify more checksum data per file.

6.25 a. With 10 GB of data, it takes you roughly 200 seconds to scan all of the data atthe 50 MB/sec rate. With 1 bit flip per GB per month, assuming a 30-daymonth, you get a bit flip on average once every 3 days in your 10 GB data set.Hence, to avoid a second bit flip corrupting another block in the same file,you should scan every few days or so to repair the likely flipped bit. A moresophisticated answer could take the file size distribution into account, but weleave that as an exercise for someone else.

b. Because it takes 200 seconds to scan a GB (and thus detect and potentiallycorrect a single bit flip), if we get more than 1 bit flip every 200 seconds, weare in trouble. Translating to bit flips per GB per month, we know that thereare 2,592,000 seconds per 30 day month, we get roughly 1,296 flips per GBper month.

6.26 a. A 40 MB file requires one 4-byte checksum per 8-KB block, and 40 MB ofmirror data. Hence, the additional write traffic is dominated by the mirror. Itis easy to calculate that 40 MB implies 5120 blocks which leads to 20 KB ofchecksums plus 40 MB of mirrors, or roughly 40 MB of extra traffic. As apercentage of total traffic, this is substantial: 100% more write traffic.

b. If we just focus on mirror traffic (as it dominates), 10,000 random writes gen-erates an additional 10,000 mirror writes, or roughly an additional 39.1 MBof traffic (100% more). However, if we consider that each write is to a 4 KBchunk of a file, but that checksums are calculated over 8 KB, we realize thatto generate the correct new checksum, we must read in the old 8 KB block,update it with the new 4 KB write, and compute the checksum over that.Thus, to keep checksums in order, we have to read in 10,000 8 KB blocks, or78.2 MB of traffic (another 200%). The mismatch between the checksummedblock size and the update size causes a great deal of additional traffic.

48192------------

204096------------


c. First, we must figure out how much checksum data is on disk. With a 10,000block file, there must be 0.019 MB of checksums (roughly). Assume we readthose into memory first, and then proceed to read the file sequentially. Thus,we get a single seek to the checksum file (9 ms), 0.19 MB read at 100 MB/sec(.4 ms), another seek to the beginning of the file data (9 ms), and a longsequential read of the file (781.3 ms). Thus, the total time to read the file andchecksums is 799.7, of which 9.4 ms is due to the checksums. Overall, forsequential reads, the time penalty of checksums is small.

d. Assume that for each random read, you never have the checksum cached, andhence you must read it from disk too. Hence, for each random read, you mustseek to the checksum (9 ms), read it from disk (negligible, just reading 4bytes or more realistically a single block at 50 MB/sec), seek to the datablock (9 ms), and read it too (again, negligible). Hence, the total time per readis 18 ms, and for all 10,000 reads, 180 seconds. The time penalty due tochecksums is half of that time, or 90 seconds. Random reads, when check-sums aren’t cached, are much more costly.

6.27 Solutions may vary.

Case Study 7: Sorting Things Out

6.28 a. � Fast CPU: $1000

� Fast Memory: $1000

� 2 × Fast I/O Interconnect: $600

� 4 × Fast Disks: $1200

� Total: $3800

b. � Reading: 3.2 sec

� Sorting:1.37 sec

� Writing: 3.2 sec

� Total: 7.77 sec

c. In this system, it would have to be said that I/O is the bottleneck, as readingand writing time dominate. The I/O is limited by the speed of the I/O busses,and the sorting is limited by the rate of the CPU.

6.29 This is a bit of a search problem. Probably the easiest way to do it is to write a lit-tle program that searches through the possibilities. The configuration we came upwith is described below, although many other similarly performing configura-tions are possible.

a. � Slow CPU: $200

� Standard memory: $400

� 2 × Fast I/O interconnect: $800

� 8 × Slow Disks: $800

� Total: $2000


b. � Reading: 3.2 sec

� Sorting: 4.1 sec

� Writing: 3.2 sec

� Total: 10.5 sec

c. Sorting is much more painful in this world, with the CPU dominating sortingtime. The I/O system is fine and limited by the disks and the bus at the sametime, which is well balanced. Memory bandwidth is not an issue in all cases.

6.30 a. Although it is hard to compute average seek from a full seek cost, we canestimate average as something like one-third of the full cost: 8 ms. A full rota-tion takes 4 ms, so average rotation is 2 ms. Hence, total positioning time is10 ms. Reading 1 KB of data at the full transfer rate takes a fraction of a msor so, which we can probably ignore. Hence, every 1 KB takes about 10 ms toread. Thus, the full time to read in all 100 runs is: 1,048,576 seconds. The fulltime to write out the data is 10,240 seconds, and hence the total time for themerge is: 17,646.9 minutes.

b. The time to read a 1 MB chunk is again the positioning time, 10 ms, plus thetransfer time, which is 100 ms. Hence, a total of 110 ms. Thus, the time toread all 100 1-GB runs is: 11264 seconds. The write time is unchanged:10,240 seconds. Hence, the total time is: 21504.0 seconds.

c. Disk efficiency in the first case is 0.019, and 0.952 in the second. As you cansee, the tiny block size (1 KB) makes very poor utilization of the disk.


Documents

Chapter 1 Solutions 2 Chapter 2 Solutions 6 Chapter … · Chapter 2 Solutions 6 Chapter 3 Solutions 19 ... Figure L.2 Baseline performance (in cycles, per loop iteration) of the