Exercises for EIT090 Computer Architecture HT2 2009 · 7 Exercises week 14 19 ... Exercise 1.8 Shortly describe the three C’s model. Exercise 1.9 Explain where replacement policy

Exercises for

EIT090 Computer Architecture

HT2 2009

DRAFT

Anders ArdoDepartment of Electrical and Information Technology, EIT

Lund University

October 22, 2009

Contents

1 Exercises week 8 31.1 Home Assignment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Memory systems, Cache II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Exercises week 9 62.1 Memory systems, Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Exercises week 10 83.1 Storage systems, I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Exercises week 11 114.1 Multiprocessors I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Exercises week 12 145.1 Home Assignment 2 - online quiz . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.2 Multiprocessors II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

6 Exercises week 13 166.1 Old exam 2003-12-17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

7 Exercises week 14 197.1 Questions and answers session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

8 Brief answers 208.1 Memory systems, Cache II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208.2 Memory systems, Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 218.3 Storage systems, I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258.4 Multiprocessors I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258.5 Multiprocessors II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278.6 Old exam 2003-12-17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1

2

1 Exercises week 8

1.1 Home Assignment 1

OPTIONAL!However - An approved home assignment will give you 2 extra points on theexam.

Select one item from the list below, and describe it as detailed as possible. You shouldcharacterize it along all relevant subjects (like ISA, pipeline type, stages, registers, registerrenaming, type of instruction issue, type of instruction commit, type of scheduling, branchprediction, superscalar, VLIW, caches, cache optimizations, bandwidth, size, no of transistors,etc) covered by this course. Your report have to include all references used to find the informationincluded.

• AMD Barcelona

• AMD Phenom II X4 965

• Intel Core 2

• Intel Core i7

• Intel/HP Itanium 2

• IBM Power6

• IBM Power7

• ARMv7

• ARM Cortex-A8

Home assignments are individual. You are not allowed to copy from each other.You should aim at producing 2-5 A4 pages text (more if many figures are used) including

references.The text can be either in English or Swedish.You are allowed to cite 1-3 sentences with explicit reference to the source. Larger pieces

of text copied from any source will be detected by the Urkund system to which your homeassignments will be submitted, and are of course not allowed. Any detected plagiarism willautomatically generate a fail on the assignment.

The home assignment must at least contain:

• The title of the home assignment

• Your name

• The date it was written

All the sources you have used must be listed in a ’References’ section at the end of your homeassignment. You can use any number of sources to find the information you need to completethe assignment, for example:

• the course literature

3

• E-huset physical library (http://www.ehuset.lth.se/english/library/)

• LUB digital library (http://elin.lub.lu.se/)

• Internet/Web based sources

Home assignments must be computer readable documents and sent, by email, [email protected]

Home assignments HAVE to be in my mailbox no later than November 8, 2009.

1.2 Memory systems, Cache II

Exercise 1.1 What is a replacement algorithm? Why is such an algorithm needed with cachememories? With which of the following strategies is a replacement algorithm needed:

• Direct mapping

• Set-associative mapping

• Fully associative mapping

Exercise 1.2 Draw a schematics of how the following cache memory can be implemented:Total size 16kB, 4-way set-associative. Blocksize 16 byte, replacement algorithm LRU, usescopy-back.

The schematics should among others show central MUXes, comparators and connections. Itshould clearly indicate how a physical memory address is translated to a cache position.

The greater detail shown the better.

Exercise 1.3 What is a write-through cache? Is it faster/slower than a write-back cache withrespect to the time it takes for writing.

Exercise 1.4 Hennessy/Patterson, Computer Architecture, 4th ed., exercise 5.4

4


Exercise 1.6 Three ways with hardware and/or software to decrease the time a program spendson (data) memory accesses are:

• nonblocking caches with “hit under miss”

• hardware prefetching with 4–8 stream buffers

• software prefetching with nonfaulting cache prefetch

Explain, for each of these methods, how it affects:

a) miss rate

b) memory bandwidth to the underlying memory

c) number of executed instructions

Exercise 1.7 In systems with a write-through L1 cache backed by a write-back L2 cache insteadof main memory, a merging write buffer can be simplified.

a) Explain how this can be done.

b) Are there situations where having a full write buffer (instead of the simple version youhave just proposed) could be helpful?

Exercise 1.8 Shortly describe the three C’s model.

Exercise 1.9 Explain where replacement policy fits into the three C’s model, and explain whythis means that misses caused by a replacement policy are “ignored”- or more precisely cannotin general be definitively classified - by the three C’s model.

5

2 Exercises week 9

2.1 Memory systems, Virtual Memory

Exercise 2.1 As caches increase in size, blocks often increase in size as well.

a) If a large instruction cache has larger blocks, is there still a need for pre-fetching? Explainthe interaction between pre-fetching and increased block size in instruction caches.

b) Is there a need for data pre-fetch instructions when data blocks get larger?

Exercise 2.2 Some memory systems handle TLB misses in software (as an exception), whileothers use hardware for TLB misses.

a) What are the trade-offs between these methods for handling TLB misses?

b) Will TLB miss handling in software always be slower than TLB misses in hardware?Explain!

c) Are there page table structures that would be difficult to handle in hardware, but possiblein software? Are there any such structures that would be difficult for software to handlebut easy for hardware to manage?

Exercise 2.3 The difficulty of building a memory system to keep pace with faster CPUs isunderscored by the fact that the raw material for main memory is the same as that found inthe cheapest computer. The performance difference is rather based on the arrangement.

a) List the four measures in answer to questions on block placement, identification, replace-ment and writing strategy that rule the hierarchical construction for virtual memory.

b) What is the main purpose of the Translation-Lookaside Buffer within the memory hierar-chy? Give an appropriate set of construction rules and explain why.

c) Fill the following table with characteristic (typical) entries:

TLB 1st-level cache 2nd-level cache Virtual memoryBlock size (in bytes)Block placementOverall size

Exercise 2.4 Designing caches for out-of-order (OOO) superscalar CPUs is difficult for severalreasons. Clearly, the cache will need to be non-blocking and may need to cope with severaloutstanding misses. However, the access pattern for OOO superscalar processors differs fromthat generated by in-order execution.

What are the differences, and how might they affect cache design for OOO processors?

Exercise 2.5 Consider the following three hypothetical, but not atypical, processors, which werun with the SPEC gcc benchmark

1. A simple MIPS two-issue static pipe running at a clock rate of 4 GHz and achievinga pipeline CPI of 0.8. This processor has a cache system that yields 0.005 misses perinstruction.

6

2. A deeply pipelined version of a two-issue MIPS processor with slightly smaller caches anda 5 GHz clock rate. The pipeline CPI of the processor is 1.0, and the smaller caches yield0.0055 misses per instruction on average.

3. A speculative MIPS, superscalar with a 64-entry window. It achieves one-half of the idealissue rate measured for this window size (9 instruction issues per cycle). This processorhas the smallest caches, which leads to 0.01 misses per instruction, but hides 25scheduling.This processor has a 2.5 GHz clock.

Assume that the main memory time (which sets the miss penalty) is 50 ns. Determine therelative performance of these three processors.

Exercise 2.6 a) Give three arguments for larger pages in virtual memory, and one against.

b) Describe the concepts ’page’, ’page fault’, virtual address’, ’physical address’, ’TLB’, and’memory mapping’ and how how they are related.

c) How much memory does the page table, indexed by the virtual page number, take for asystem using 32 bit virtual addresses, 4 KB pages, and 4 bytes per page table entry. Thesystem has 512 MB of physical memory.

d) In order to save memory sometimes inverted page tables are used. Briefly describe howthey are structured. How much memory would inverted page tables take for the abovesystem.

Exercise 2.7 A) Describe two cache memory optimization techniques that may improve hitperformance (latency and throughput). For each technique, specify how it affects hit timeand fetch bandwidth.

B) Describe two cache memory optimization techniques that may reduce miss rate, and definethe miss type (compulsory, capacity, conflict) that is primarily affected by each technique.

C) Describe two cache memory optimization techniques that may reduce miss penalty.

7

3 Exercises week 10

3.1 Storage systems, I/O

For Hennessy/Patterson exercises 6.8 - 6.14.





8

For Hennessy/Patterson exercises 6.19 - 6.22.



9


10

4 Exercises week 11

4.1 Multiprocessors I

Exercise 4.1 There are two main varieties (classes) of hardware-based cache coherence proto-cols. Which are they and what are the main differences, strengths and weaknesses?

Exercise 4.2 Briefly describe MIMD and SIMD computers outlining the differences. Giveexamples of computers (or type of computers) from each class.

Exercise 4.3 Assume a directory-based cache coherence protocol. The directory currently hasinformation that indicates that processor P1 has the data in “exclusive” mode.

If the directory now gets a request for the same cache block from processor P1, what couldthis mean? What should the directory controller do?

Exercise 4.4 Although it is widely believed that buses are the ideal way to interconnect small-scale multiprocessors, this may not always be the case. For example, increases in processorperformance are lowering the processor count at which a more distributed implementation be-comes attractive. Because a standard bus-based implementation uses the bus both for access tomemory and for inter-processor coherency traffic, it has a uniform memory access time for both.In comparison, a distributed memory implementation may sacrifice on remote memory access,but it can have a much better local memory access time.

Consider the design of a multi-processor with 16 processors. Each CPU is driven by a 150MHz clock. Assume that a memory access takes 150 ns from the time the address is availablefrom either the local processor or a remote processor until the first word is delivered. The busis driven by a 50 MHz clock. Each bus transaction takes five bus clock cycles, each 20 ns inlength, to perform arbitration, resolution, address, decode and acknowledge.

The detection of the miss and the generation of the memory request by the processor consistsof three steps: detecting a miss in the primary on-chip cache; initiating a secondary (off-chip)cache access and detecting a miss in the secondary cache; and driving the complete addressoff-chip through the bus. This process takes about 40 processor clock cycles.

For the bus and memory component, the initial read request is one bus transaction of 5 buscycles. The latency until memory is ready to transfer is 12 bus clock cycles. The reply will thentransfer all 128 bytes of a cache block in one reply transaction, taking 5 bus clock cycles. Thetotal is 22 bus clock cycles, which equals 66 processor clocks. It takes 16 bus cycles to reloadthe cache line, while restarting the pipeline takes 10 processor cycles. The total is 58 cycles

a) How fast is a local access?

b) Assume that the interconnect is a 2-D grid with links that are 16 bits wide and clockedat 100 MHz, with a start-up time of five cycles for a message. Assume one clock cyclebetween nodes in the network, and ignore overhead in the messages and contention (i.e.assume that the network bandwidth is not the limit). Find the average remote memoryaccess time, assuming a uniform distribution of remote requests.

Exercise 4.5 Nearly all computer manufacturers offer today multi-core microprocessors. Thisassignment focuses on concepts central to how thread-level parallelism can be exploited to offera higher computational performance.

11

a) The performance of a superscalar processor is limited by the amount of instruction-levelparallelism in the program. In particular, when a load instruction must fetch data frommemory, it can be difficult to find a sufficient number of independent instructions toexecute while the data is being fetched from memory. Multithreading is a technique todo useful work while waiting for the data to be returned from memory. Explain how thefollowing concepts can keep the processor busy doing useful work:

– Fine-grain multithreading

– Coarse-grain multi-threading

– Simultaneous multithreading

b) What structures in a superscalar processor must be replicated to realize a simultaneousmultithreaded processor?

c) Flynn classifies computer architectures that leverage thread-level parallelism into fourcategories. Which ones?

d) Shared-memory multiprocessors is an important class of architectures that form the basisfor multi-core microprocessors. The memory model is such that all processors access thesame memory.

– What is cache coherence?

– How does an invalidation-based cache coherence protocol work?

– How is the lock primitive in a critical section implemented using test-and-set instruc-tions?

Exercise 4.6 a) What structures in a superscalar processor must be replicated to realize asimultaneous multithreaded processor?

Shared-memory multiprocessors is an important class of architectures that form the basisfor multi-core microprocessors. The memory model is such that all processors access the samememory.

b) What is cache coherence? Give an example of what can happen if cache coherence ismissing.

c) A commonly used cache coherence protocol relies on snooping and invalidations. Below youfind a list of requests that arrive to the cache coherence mechanism. Connect all requests,A-N, with the correct cache action and explanation, 1-14. Hint: each request matchesexactly one action/explanation. Your answer should be a table listing all connections likeA-3, B-2, C-8, etc...

12

Request Source State of addressed cache blockRead hit Processor shared or modified ARead miss invalid BRead miss shared CRead miss modified DWrite hit modified EWrite hit shared FWrite miss invalid GWrite miss shared HWrite miss modified IRead miss Bus shared JRead miss modified KInvalidate shared LWrite miss shared MWrite miss modified N

Type of cache action Function and explanation1 normal hit Write data in cache.2 coherence Place invalidate on bus.3 coherence Attempt to write block that is shared; invalidate the cache

block4 normal hit Read data in cache.5 replacement Address conflict miss: write back block, then place write

miss on bus.6 normal miss Place read miss on bus.7 replacement Address conflict miss: write back block, then place read miss

on bus.8 normal miss Place write miss on bus.9 coherence Attempt to write shared block; invalidate the block.10 coherence Attempt to write block that is exclusive elsewhere: write

back the cache block and make its state invalid11 coherence Attempt to share data: place cache block on bus and change

state to shared.12 replacement Address conflict miss: place write miss on bus.13 replacement Address conflict miss: place read miss on bus.14 no action Allow memory to service read miss.

13

5 Exercises week 12

5.1 Home Assignment 2 - online quiz

OPTIONAL!

However - An approved quiz will give you 2 extra points on the exam.

Take the quiz available for Computer Architecture EIT090, at http://courses.eit.lth.se/It will be open during week 12 and 13 (weeks 5 and 6 in HT2 - 2009-11-23 – 12-06). You haveto log in to be able to see it.

Every student has a username and password based on your official mailaddress. Examplefor [email protected] it will be:username: et01xy9password: ePWt01xy9

If you have a problem contact the course coordinator, Anders Ardo. You can take the quizany number of times during the time mentioned above.

When you have logged in, choose ’Computer Architecture EIT090’ and click on the quiz.

Then you can start answering questions. After all questions are answered you can send inyour answers by clicking on ’Submit all and finish’. You will get a feedback saying how manycorrect answers you have. Both questions and numeric values in the quiz are selected randomlyeach time you try the quiz. Redo the test until you have at least 90 % correct in order to beapproved.

5.2 Multiprocessors II

Exercise 5.1 Assume that we have a function for an application of the form F (i, p) which givesthe fraction of time that exactly i processors are usable given that a total of p processors areavailable. This means that

p∑i=1

F (i, p) = 1

Assume that when i processors are in use, the application runs i times faster. RewriteAmdahl’s Law so that it gives the speedup as a function of p for some application.

Exercise 5.2 One proposed solution for the problem of false sharing is to add a valid bit perword (or even for each byte). This would allow the protocol to invalidate a word withoutremoving the entire block, letting a cache keep a portion of a block in its cache while anotherprocessor writes a different portion of the block. what extra complications are introduced intothe basic snooping cache coherence protocol (see figure below) if this capability is included?remember to consider all possible protocol actions.

14

Exercise 5.3 Some systems do not use multiprocessing for performance. Instead they run thesame program in lockstep on multiple processors. What potential benefit is possible on suchmultiprocessors.

Exercise 5.4 When trying to perform detailed performance evaluation of a multiprocessor sys-tem, system designers use one of three tools: analytical models, trace-driven simulation, andexecution-driven simulation. Analytical models use mathematical expressions to model the be-havior of programs. Trace driven simulations run the applications on a real machine and generatea trace, typically of memory operations. These traces can then be replayed through a cache sim-ulator or a simulator with a simple processor model to predict the performance of the systemwhen various parameters are changed. Execution driven simulators simulate the entire executionincluding maintaining an equivalent structure for the processor state, and so on. Discuss theaccuracy/speed trade-offs between these approaches.

15

6 Exercises week 13

6.1 Old exam 2003-12-17

Exercise 6.1Computer Architecture, EIT

090Final Exam Department of Information Technology

17 December 2003 8 – 13The exam consists of a number of problems with a total of 50 points.Grading 20 p ≤ grade 3 < 30 p ≤ grade 4 < 40 p ≤ grade 5Instructions:

• You may use a pocket calculator and an English dictionary on this exam, but no otheraids

• Please start answering each problem on a new sheet – New problem =⇒ New sheet

• Write your name on each sheet of paper that you hand in – Name on each sheet

• Answers can be given in Swedish or English

• You must motivate your answers thoroughly. If there, in your opinion, is not enoughinformation to solve a problem, you can make reasonable assumptions that you need inorder to solve the problem. State these assumptions clearly!

GOOD LUCK :-)

Problem 1

Briefly (1-2 sentences) describe the following items/concepts concerning computer architecture:(10)

a) dominance

b) basic block

c) inverted page table

d) true sharing misses

e) register renaming

f) data dependency

16

g) way prediction

h) unified cache

i) sequential consistency

j) General purpose register instruction set architecture (GPR ISA)

Problem 2

a) Describe the concept “memory hierarchy”, and state why it is important. State the func-tion of each part, normally used hardware components, and what problems they solve (ifany). (5)

b) Describe the problems of implementing a “Fetch-and-increment” instruction (it atomicallyincrements the value of a memory location and saves the original value in a register) in asimple MIPS like 5 stage pipeline. What is such an instruction used for? (5)

Problem 3

a) Define an expression (Speedup= ) for pipeline speedup as a function of:

– Tunpipelined: execution time for the non-pipelined unit.

– max(Tpipestage): the execution time of the slowest pipe-stage.

– Tlatch: overhead time (setup time) for the storage element between pipe-stages.

(4)

Example:

-IN

Tunpipelined -OUT

-IN

Tpipestage1 - Tlatch -Tpipestage2 - Tlatch - Tpipestage3

clock? ?

-OUT

b) Define an expression for pipeline speedup as a function of (use only these):

– noofstages: number of pipe stages (assume equal pipe stage execution time).

– branch freq (bf): relative frequency of branches in the program.

– branch penalty (bp): number of clock cycles lost due to a branch.

(3)

c) Give an example of a piece of assembly code that contains WAW, RAW and WAR hazardsand identify them. (Use for example the assembly instructionADDD Rx,Ry,Rzwhich stores (Ry+Rz) in Rx) (3)

17

Problem 4

Compare three computers with these processors and characteristics

A: a Celeron 2.4 GHz processor, 128 KByte cache, 128 byte blocks, copy-back (write-back)with an average of 30 % dirty blocks, price 650 SEK

B: a P4 2.4 GHz processor, 512 KByte cache, 128 byte blocks, copy-back (write-back) withan average of 35 % dirty blocks, price 1495 SEK

C: a P4 3.0 GHz processor, 512 KByte cache, 128 byte blocks, copy-back (write-back) withan average of 35 % dirty blocks, price 2595 SEK

The main application is program development, so the compiler gcc is considered being the mostused program and is therefore used as the performance indicator. Assume that the processorshave the same architecture, and that the base CPI (for gcc) without effects from the abovementioned cache (but including other caches and TLB) is 1.1

Some statistics for gcc:Cache size Miss rate

512 KB 0.0075256 KB 0.0116128 KB 0.032164 KB 0.09

Instruction frequenciesload store uncond branch cond branch int fp

25.8 % 13.4 % 4.8 % 15.5 % 40.5 % 0 %Main memory take 50 ns to set up and each transfer of 128 bits from main memory to the

cache takes 4 ns. Assume that the memory system can handle memory at these speeds andwidths.

Which of the three computers (A, B, C) have the best price/performance ratio? Motivateyour answer thoroughly. (10)

Problem 5

a) There are two main varieties (classes) of hardware-based cache coherence protocols. Whichare they and what are the main differences, strengths and weaknesses. (4)

b) Briefly describe MIMD and SIMD computers, outlining the differences. Give examples ofcomputers (or type of computers) from each class. (4)

c) Use Amdahl’s law to give a quantitative argument for keeping a computer system balancedin terms of relative performance (for example processor speed versus I/O speed) as tech-nological and methodological development improves various sub-systems of a computer.(2)

18

7 Exercises week 14

7.1 Questions and answers session

19

8 Brief answers

8.1 Memory systems, Cache II

1.1 When a cache miss occurs, the controller must select a block to be replaced with the desireddata. Three primary strategies for doing this are: Random, Least-recently used (LRU), andFirst-in first-out (FIFO).

A replacement algorithm is needed with set-associative and fully associative caches. Fordirect mapped caches there is no choice, the block to replaced is uniquely determined by theaddress.

1.2 see figure

TAG

Store

Data

Store

d

i

r

t

y

Byte select

Block

index

Compare

TAG

TAG

Store

Data

Store

d

i

r

t

y

Byte select

Block

index

Compare

TAG

TAG

Store

Data

Store

d

i

r

t

y

Byte select

Block

index

Compare

TAG

TAG

Store

Data

Store

d

i

r

t

y

Byte select

Block

index

Compare

TAG

4 to 1 MUX

hit/miss

31 12 11 4 3 0

Byte

selectBlock indexTAG

1.3 A cache write is called write-through when information is passed both to the block in thecache and to the block in the lower-level memory; when information is only written to the block,it is called write-back. Write-back is the fastest of the two as it occurs at the speed of the cachememory, while multiple writes within a block require only one write to the lower-level memory.

1.4 See ’Case Study Solutions’ at http://www.elsevierdirect.com/companion.jsp?ISBN=9780123704900


1.6 a) miss rate:

– non-blocking: don’t affect miss rate, the main thing happening is that the processormakes other useful things while the miss is handled

– hardware prefetching with stream buffers: a hit in the stream buffer cancels the cacherequest, ie the memory reference is not counted as a miss, which means that the missrate will decrease.

– software prefetching: if correctly done miss rate will decrease

b) memory bandwidth:

– non-blocking: since the processor will have fewer stall cycles it will get a lower CPIand consequently the requirements on memory bandwidth will increase

20

– hardware prefetch and software prefetch: prefetching is a form of speculation whichmeans that some of the memory traffic is unused which in turn might increase theneed for memory bandwidth

c) number of executed instructions: will be unchanged for non-blocking and hardware prefetch.Software prefetch will add the prefetch–instructions, so number of executed instructionswill increase.

1.7 a) The merging write buffer links the CPU to the write-back L2 cache. Two CPU writescannot merge if they are to different sets in L2. So, for each new entry into the buffer aquick check on only those address bits that determine the L2 set number need be performedat first. If there is no match in this ‘screening’ test, then the new entry is not merged. Ifthere is a set number match, then all address bits can be checked for a definitive result.

b) As the associativity of L2 increases, the rate of false positive matches from the simplifiedcheck will increase, reducing performance.

1.8 The three C’s model sorts the causes for cache misses into three categories:

• Compulsory – The very first access can never be in cache and is therefore bound to generatea miss;

• Capacity – If the cache cannot contain all the blocks needed for a program, capacity missesmay occur;

• Conflict – If the block placement strategy is set-associative or direct-mapped, conflictmisses will occur because a block can be discarded and later retrieved if too many blocksmap to its set.

1.9 The three C’s give insight into the cause of misses, but this simple model has its limits; itgives you insight into average behavior but may not explain an individual miss. For example,changing cache size changes conflict misses as well as capacity misses, since a larger cache spreadsout references to more blocks. Thus, a miss might move from a capacity miss to a conflict missas cache size changes. Note that the three C’s also ignore replacement policy, since it is difficultto model and since, in general, it is less significant. In specific circumstances the replacementpolicy can actually lead to anomalous behavior, such as poorer miss rates for larger associativity,which is contradictory to the three C’s model.

8.2 Memory systems, Virtual Memory

2.1 a) Program basic blocks are often short (less than 10 instructions). Even program runblocks, sequences of instructions executed between branches, are not very long. Pre-fetching obtains the next sequential block, but program execution does not continue tofollow location PC, PC+4, PC+8, ...., for very long. So as blocks get larger the probabilitythat a program will not execute all instructions in the block, but rather take a branchto another instruction address, increases. Pre-fetching instructions benefits performancewhen the program continues straight-line execution into the next block. So as instructioncache blocks increase in size, prefetching becomes less attractive.

b) Data structures often comprise lengthy sequences of memory addresses. Program accessof a data structure often takes the form of a sequential sweep. Large data blocks work well

21

with such access patterns; prefetching is likely still of value due to the highly sequentialaccess patterns. The efficiency of data pre-fetch can be enhanced through a suitablegrouping of the data items taking the block limitations into account. This is especiallynoteworthy when the data-structure exceeds the cache size. Under such circumstances, itwill become of critical importance to limit the amount of out-of-cache block references.

2.2 a) We can expect software to be slower due to the overhead of a context switch to thehandler code, but the sophistication of the replacement algorithm can be higher for soft-ware and a wider variety of virtual memory organizations can be readily accommodated.Hardware should be faster, but less flexible.

b) Factors other than whether miss handling is done in software or hardware can quicklydominate handling time. Is the page table itself paged? Can software implement a moreefficient page table search algorithm than hardware? What about hardware TLB entrypre-fetching?

c) Page table structures that change dynamically would be difficult to handle in hardwarebut possible in software.

2.3 a) – As miss penalty tends to be severe, one usually decides for a complex placementstrategy. Usually one takes for full association.

– To reduce address translation time, a cache is added to remember the most likelytranslations, the Translation Lookaside Buffer.

– Almost all operating systems rely on a replacement of the least-recently used (LRU)block indicated by a reference bit, which is logically set whenever a page is addressed.

– Since the cost of an unnecessary access to the next-lower level is high, one usuallyincludes a dirty bit. It allows blocks to be written to lower memory only if they havebeen altered since reading.

b) The main purpose of the TLB is to accelerate the address translation for reading/writingvirtual memory. A TLB entry holds a portion of the virtual address, a physical pageframe number, a protection field, a valid bit, a use bit and a dirty bit. The latter twois not always used. The size of the page table is inversely proportional to the page size;choosing a large page size allows larger caches with fast cache hit times with a small TLB.A small page size conserves storage, limiting the amount of internal fragmentation. Theircombined effect can be seen in process start-up time, where a large page size lengthensinvocation time but shortens page renewal times. Hence, the balance goes for large pagesin large computers and vice-versa.

c)

TLB 1st-level cache 2nd-level cache Virtual memoryBlock size (in bytes) 4-32 16-256 1-4k 4096-65,536Block placement Full associative 2/4-way set 8/16-way set Direct mapped

associative associativeOverall size 32-8,192b 1 MB 2-16MB 32 MB – 1 TB

2.4 Out-of-order (OOO) execution will change both the timing of and sequence of cache accesswith respect to that of in-order execution. Some specific differences and their effect on whatcache design is most desirable are explored in the following.

Because OOO reduces data hazard stalls, the pace of cache access, both to instructions anddata, will be higher than if execution were in order. Thus, the pipeline demand for available

22

cache bandwidth is higher with OOO. This affects cache design in areas such as block size, writepolicy, and pre-fetching.

Block size has a strong effect on the delivered bandwidth between the cache and the nextlower level in the memory hierarchy. A write-through write policy requires more bandwidthto the next lower memory level than does write back, generally, and use of a dirty bit furtherreduces the bandwidth demand of a write-back policy. Pre-fetching increases the bandwidthdemand. Each of these cache design parameters – block size, write policy, and pre-fetching –is in competition with the pipeline for cache bandwidth, and OOO increases the competition.Cache design should adapt for this shift in bandwidth demand toward the pipeline.

Cache accesses for data and, because of exception, instructions occur during execution. OOOexecution will change the sequence of these accesses and may also change their pacing.

A change in sequence will interact with the cache replacement policy. Thus, a particularcache and replacement policy that performs well on a chosen application when execution of thesuperscalar pipeline is in order may perform differently – even quite differently – when executionis OOO.

If there are multiple functional units for memory access, then OOO execution may allowsbunching multiple accesses into the same clock cycle. Thus, the instantaneous or peak memoryaccess bandwidth from the execution portion of the superscalar can be higher with OOO.

Imprecise exceptions are another cause of change in the sequence of memory accesses fromthat of in-order execution. With OOO some instructions from earlier in the program order maynot have made their memory accesses, if any, at the time of the exception. Such accesses maybecome interleaved with instruction and data accesses of the exception-handling code. Thisincreases the opportunity for capacity and conflict misses. So a cache design with size and/orassociativity to deliver lower numbers of capacity and conflict misses may be needed to meetthe demands of OOO.

2.5 First, we use the miss penalty and miss rate information to compute the contribution toCPI from cache misses for each configuration. We do this with the formula:

Cache CPI = Misses per instruction ∗Miss Penalty

We need to compute the miss penalties for each system:

Miss Penalty =Memory Access T ime

Clock Cycle

The clock cycle times for the processors are 250 ps, 200 ps, and 400 ps, respectively. Hence, themiss penalties are

1 :50 ns

250 ps= 200 cycles

2 :50 ns

200 ps= 250 cycles

3 :0.75 ∗ 50 ns

400 ps= 94 cycles

Applying this for each cache:CPI1 = 0.005 ∗ 200 = 1.0

CPI2 = 0.0055 ∗ 250 = 1.4

CPI3 = 0.01 ∗ 94 = 0.94

23

We know the pipeline CPI contribution for everything but processor 3; its pipeline CPI is givenby

Pipeline CPI = 1/Issue rate =1

9 ∗ 0.5= 1/4.5 = 0.22

Now we find the CPI for each processor by adding the pipeline and cache CPI contributions:

1 : 0.8 + 1.0 = 1.8

2 : 1.0 + 1.4 = 2.4

3 : 0.22 + 0.94 = 1.16

Since this is the same architecture, we can compare instruction execution rates in millions ofinstructions per second (MIPS) to determine relative performance CR / CPI as

1 :4000 MHz

1.8= 2222 MIPS

2 :5000 MHz

2.4= 2083 MIPS

3 :2500 MHz

1.16= 2155 MIPS

In this example, the simple two-issue static superscalar looks best. In practice, performancedepends on both the CPI and clock rate assumption.

2.6 a) For:

– Size of the page table is inversely proportional to the page size

– Larger page sizes allow larger caches using a virtually indexed, physically taggeddirect mapped cache.

– The number of TLB entries are restricted, larger page size means more memorymapped efficiently.

Against: Larger pages lead more wasted storage due to internal fragmentation.

b) In a virtual memory system: Virtual address is a logical address space for a process. Thisis translated by a combination of hardware and software into a physical address whichaccess main memory. This process is called memory mapping. The virtual address spaceis divided into pages (blocks of memory). Page fault: an access to a page which is not inphysical memory. TLB, Translation Lookaside Buffer is a cache of address translations.

c) Page table takes 232

212 ∗ 4 = 222 = 4 Mbyte

d) An inverted page table is like a fully associative cache where each page table entry containsthe physical address and, as tag, the virtual address. It takes 229

212 ∗ (4 + 4) = 220 = 1 Mbyte

2.7 See lecture slides, HP Ch. 5 and App C.

24

8.3 Storage systems, I/O








8.4 Multiprocessors I

4.1 • Snooping

– Status for a block is stored in every cache that has a copy of the block.

– Sends all requests to all processors (broadcast).

– Caches monitor (snoop) the shared memory bus to update status and take actions.

– Popular with single shared memory.

• Directory based

– Status for a block is stored in one location (the directory).

– Messages used to update status.

– Scales better than snooping.

– Popular with distributed shared memory.

4.2 MIMD = Multiple Instruction stream, Multiple Data streammultiprocessors - Symmetric shared memory Multiprocessors (SMP) with Uniform MemoryAccess time (UMA) and bus interconnect.

SIMD = Single Instruction stream, Multiple Data streamvector processors.

4.3 The problem illustrates the complexity of cache coherence protocols. In this case, thiscould mean that the processor P1 evicted that cache block from its cache and immediatelyrequested the block in subsequent instructions. Given that the write-back message is longerthan the request message, with networks that allow out-of-order requests, the new request canarrive before the write-back arrives at the directory. One solution to this problem would beto have the directory wait for the write-back and then respond to the request. Alternatively,the directory can send out a negative acknowledge (NACK). Note that these solutions need tobe thought out very carefully since they have the potential to lead to deadlocks based on theparticular implementation details of the system. Formal methods are often used to check forraces and deadlocks.

25

4.4 The question is to consider a design that is based on a mesh interconnect rather than on abus. The idea behind such a design is that local accesses will be faster than a pure shared-memoryapproach since access to local memory does not need to go across a shared bus. Additionally,the cost of a remote access will be a function of start-up time and number of hops across thenetwork rather than the time to acquire the bus.

a) The cost of local references is easy to compute. Local references require 40 clocks to detectL2 miss, 66 clock to deliver the data, and 58 clocks to reload caches, yielding a total of164 clocks.

b) For this part of the problem, we are asked to factor in the cost of references that need totravel across the mesh network and to compute the average remote memory access time(ARMAT). Since network clocks and processor clocks take different amounts of time, werefer to network clocks as ‘nclk’ and to processor clocks as ‘pclk’. From the above wealready know that the cost on a shared-memory design is 164 pclks.

For the case of the distributed memory, we will make the simple assumption that eachremote reference must make on average 1.5 hops in the X-direction and 1.5 hops in theY-direction to get to its target node, because the result depends on how one measuresthe average number of hops that a reference must make in the network. Hence the totalaverage distance is 3 hops.

For the complete request, the following times must be added together: the time for theL2 miss to be recognized locally, the time for the address request to go across the network(assume only 32 bits are needed for this message), the time for the remote memory torespond (150 ns), the time for the data to return over the network (a 128-byte cache line),and the time for the caches to be reloaded. The time across the network is based on thenumber of hops and the size of the message. We are given that 2 bytes can be sent everynclk, and so the time through a switch isTime To Send = (Number of Bytes ) / 2 bytes per nclk.

Putting this into an equation yields ARMAT = 40 pclks + 5 nclks + 2 nclks + (3 + 1nclks) + 150 ns + 64 nclks + (3 + 1) nclks + 58 pclks = 1574 ns.

4.5 See Chapter 3 and 4 in the book.

4.6 a) To realize SMT, we need to have a per-thread renaming table, having separate PCregisters, and provide the capability for instructions from multiple threads to commit.

b) Informally, cache coherence means that a value read from the memory systems shouldreflect the latest write to that same memory location. For an example of what happenswhen cache coherence is missing, refer to the book, Figure 4.3 (page 206).

26

c) The correct connections:

Request ActionA 4B 6C 13D 7E 1F 2G 8H 12I 5J 14K 11L 9 or 3M 3 or 9N 10

Note: 3 and 9 is equivalent

8.5 Multiprocessors II

5.1 The general form for Amdahl’s Law (as shown on the inside front cover of this text) is

Speedup = fracExecution timeoldExecution timenew

all that needs to be done to compute the formula for speedup in this multiprocessor case is toderive the new execution time. The exercise states that for the portion of the original executiontime that can use i processors is given by F (i, p). If we let Execution timeold be 1, then therelative time for the application on p processors is given by summing the times required for eachportion of the execution time that can be sped up using i processors, where i is between 1 andp. This yields

Execution timenew =p∑

i=1

F (i, p)i

Substituting this value for Execution timenew into the speedup equation makes Amdahl’sLaw a function of the available processors, p.

5.2 An obvious complication introduced by providing a valid bit per word is the need to matchnot only the tag of the block but also the offset within the block when snooping the bus. Thisis easy, involving just looking at a few more bits. In addition, however, the cache must bechanged to support write-back of partial cache blocks. When writing back a block, only thosewords that are valid should be written to memory because the contents of invalid words are notnecessarily coherent with the system. Finally, given that the state machine of Figure 6.12 isapplied at each cache block, there must be a way to allow this diagram to apply when state canbe different from word to word within a block. The easiest way to do this would be to providethe state information of the figure for each word in the block. Doing so would require muchmore than one valid bit per word, though. Without replication of state information the onlysolution is to change the coherence protocol slightly.

5.3 Executing the identical program on more than one processor improves system ability totolerate faults. The multiple processors can compare results and identify a faulty unit by itsmismatching results. Overall system availability is increased.

27

5.4 Analytical models can be used to derive high-level insight on the behavior of the system in avery short time. Typically, the biggest challenge is in determining the values of the parameters.In addition, while the results from an analytical model can give a good approximation of therelative trends to expect, there may be significant errors in the absolute predictions.

Trace-driven simulations typically have better accuracy than analytical models, but needgreater time to produce results. The advantages are that this approach can be fairly accuratewhen focusing on specific components of the system (e.g., cache system, memory system, etc.).However, this method does not model the impact of aggressive processors (mispredicted path)and may not model the actual order of accesses with reordering. Traces can also be very large,often taking gigabytes of storage, and determining sufficient trace length for trustworthy resultsis important. It is also hard to generate representative traces from one class of machines that willbe valid for all the classes of simulated machines. It is also harder to model synchronization onthese systems without abstracting the synchronization in the traces to their high-level primitives.

Execution-driven simulation models all the system components in detail and is consequentlythe most accurate of the three approaches. However, its speed of simulation is much slower thanthat of the other models. In some cases, the extra detail may not be necessary for the particulardesign parameter of interest.

8.6 Old exam 2003-12-17

6.1 Solution sketchesVersion 1.0 2003-12-19

Problem 1

Describe the following items/concepts concerning computer architecture: (10)

a) Optimizing compilers generate a control-flow graph with a number of nodes, including uand v. If all paths from the start to v include u then u dominates v.

b) Straight line code sequence with no branches in except at entry and no branches out exceptat the exit.

c) A page table that uses hashing techniques to reduce the size of the page table so that thelength is equal to the number of physical pages in memory.

d) Misses arising from communication of data through cache coherence mechanism.

e) A set of physical registers holds both architecturally visible register as well as temporarydata. During instruction issue architectural registers are mapped to physical registers.Register renaming is used to get rid of WAR and WAW hazards.

f) An instruction j is data dependent on instruction i, if i produces a result that may be usedby j.

g) An attempt to predict which block the next cache access will go to. It allows you to earlyset up the multiplexor that selects cache block.

h) A cache that holds both instructions and data.

28

i) Sequential consistency requires that the result of any execution be the same as if thememory accesses executed by each processor were kept in program order.

j) GPR have only explicit operands, either memory locations or registers, as opposed toimplicit operands like stack top or accumulator.

Problem 2

a) In real life bigger memory is slower and faster memory is more expensive. We want tosimultaneously increase the speed and decrease the cost. Speed is important because thewidening performance gap between CPU and memory. Size is important since applica-tions and data sets are growing bigger. Use several types of memory with varying speedsarranged in a hierarchy that is optimized with respect to the use of memory. Mappingfunctions provide address translations between levels.

Registers: internal ultrafast memory for CPU; static register

Cache: speed up memory access; SRAM

Main memory: DRAM

VM: make memory larger, disk; safe sharing between processes of physical memory, pro-tection, relocation

(archival storage, backup on tape)

b) The instruction is used to implement synchronization in a multiprocessor. A MIPS-likepipeline assumes that an instruction only reads or writes memory, not both, and it’s noteasily modified to do that.

Problem 3

a) Speedup =Tunpipelined

max(Tpipestage)+Tlatch

b) Speedup = noofstages1+bf∗bp

29

c) 1: MOV R3, R72: LD R8,(R3)3: ADDDI R3, R3, 44: LD R9, (R3)5: BNE R8, R9, Loop

WAW: 1,3 RAW: 1,2; 1,3; 2,5; 3,4; 4,5 WAR: 2,3;

Problem 4

Texe = IC ∗ (CPIbase + # accessesinstruction ∗Miss rate ∗Miss penalty) ∗ TC

Miss penalty = (1 + Fragment dirty) ∗ CPs per block transfer

CPs per block transfer = int( setup

TC) + 8 ∗ (int( transfer

TC))

A. Celeron B. P4 2.4 C. P4 3.0Clock 2.4000 2.4000 3.0000

# accessesinstruction 1.3920 1.3920 1.3920 (1+0.258+0.134)Miss rate 0.0321 0.0075 0.0075

fragment dirty 0.3000 0.3500 0.3500transfer ns 4.0000 4.0000 4.0000setup ns 50.0000 50.0000 50.0000

setup clocks 121.0000 121.0000 151.0000 int( setup

TC)

transfer clocks 80.0000 80.0000 104.0000 8 ∗ (int( transfer

TC))

CPI base 1.1000 1.1000 1.1000

CPI cache 11.6757 2.8329 3.5940 ( # accessesinstruction ∗Miss rate ∗Miss penalty)

TexeIC 5.3232 1.6387 1.5647

Price 650 1495 2595Price/performance 3460.0909 2449.8652 4060.2841 Price

(1/(Texe

IC))

Problem 5

a) – Snooping

∗ Status for a block is stored in every cache that has a copy of the block.∗ Send all requests to all processors (broadcast)∗ Caches monitor (snoop) the shared memory bus to update status and take ac-

tions.∗ Popular with single shared memory.

– Directory based

∗ Status for a block is stored in one location (the directory).∗ Messages used to update status.∗ Scales better than snooping∗ Popular with distributed shared memory.

30

b) SIMD = (Single Instruction stream, Multiple Data stream)vector processors

MIMD = (Multiple Instruction stream, Multiple Data stream)multiprocessors - Symmetric shared memory Multiprocessors (SMP) with Uniform Mem-ory Access time (UMA) and bus interconnect.

c) – CPU performance increases 50% to 100% / year

– I/O system performance limited by mechanical delays

– Amdahl’s law: system speedup limited by the slowest component:

∗ Assume 10% I/O∗ CPU relative speedup = 10 =⇒ System speedup = 5∗ CPU relative speedup = 100 =⇒ System speedup = 10

31

Documents

Exercises for EIT090 Computer Architecture HT2 2009 · 7 Exercises week 14 19 ... Exercise 1.8 Shortly describe the three C’s model. Exercise 1.9 Explain where replacement policy