ACA Answer Key Best of All v2

BEST OF ALL QUESTION PAPERSAll Question papers & assignments, Test

Sep 2010 - Question 1

Write the neat functional structure of an SIMD array processor with concurrent scalar processing in the control unit. (10 marks)

SIMD Supercomputers

Operational model is a 5-tuple (N, C, I, M, R). N = number of processing elements (PEs). C = set of instructions (including scalar and

flow control) I = set of instructions broadcast to all PEs for

parallel execution. M = set of masking schemes used to partion

PEs into enabled/disabled states. R = set of data-routing functions to enable

inter-PE communication through the interconnection network.

Operational Model of SIMD Computer

Interconnection Network

…

P

M

Control Unit

P

M

P

M

Sep 2010 - Question 2

Define clock rate, CPI, MIPS rate and throughput rate. (10 marks)

Clock Rate

Clock rate - CPU is driven by a clock with a constant cycle time Cycle time is represented using T in

nanoseconds Inverse of cycle time is the clock rate

(f=1/T) f = 1 in megahertz

The clock rate is the rate in cycles per second (measured in hertz) or the frequency of the clock in any synchronous circuit, such as a central processing unit (CPU).

http://www.google.com/url?q=http://en.wikipedia.org/wiki/Hertz&sa=D&sntz=1&usg=AFQjCNFnaPVJHHZB7WXJpMIL2DPIl8IwaQ

http://www.google.com/url?q=http://en.wikipedia.org/wiki/Synchronous_circuit&sa=D&sntz=1&usg=AFQjCNG6085a4SZJRnQpFYTUK8SXIa67rQ

http://www.google.com/url?q=http://en.wikipedia.org/wiki/Central_processing_unit&sa=D&sntz=1&usg=AFQjCNHFultNh4VomSTwnYERPdg14YeZTQ

CPI (Cycles per instructions)

MIPS(Million of Instructions per second)

Throughput rate

Sep 2010 – Question 5

What are the reservation tables in pipelining? Mention its advantages.

EENG-630

11

Reservation Table

Specifies utilization pattern of successive stages

Follows a diagonal streamline Need k clock cycles to flow through One result emerges at each cycle if tasks

are independent of each other

Sep 2010 – Q 6

Explain TLB, Paging and Segmentation in virtual memory

Virtual Memory

To facilitate the use of memory hierarchies, the memory addresses normally generated by modern processors executing application programs are not physical addresses, but are rather virtual addresses of data items and instructions.

Physical addresses, of course, are used to reference the available locations in the real physical memory of a system.

Virtual addresses must be mapped to physical addresses before they can be used.

Mapping Efficiency

Efficient implementations are more difficult in multiprocessor systems where additional problems such as coherence, protection, and consistency must be addressed.

Virtual Memory Models (1)

Private Virtual Memory In this scheme, each processor has a separate

virtual address space, but all processors share the same physical address space.

Virtual Memory Models (2)

Shared Virtual Memory All processors share a single shared virtual

address space, with each processor being given a portion of it.

Memory Allocation

Both the virtual address space and the physical address space are divided into fixed-length pieces. In the virtual address space these pieces

are called pages. In the physical address space they are

called page frames. The purpose of memory allocation is to

allocate pages of virtual memory using the page frames of physical memory.

Address Translation Mechanisms [Virtual to physical] address translation requires

use of a translation map. The virtual address can be used with a hash function to

locate the translation map (which is stored in the cache, an associative memory, or in main memory).

The translation map is comprised of a translation lookaside buffer, or TLB (usually in associative memory) and a page table (or tables). The virtual address is first sought in the TLB, and if that search succeeds, not further translation is necessary. Otherwise, the page table(s) must be referenced to obtain the translation result.

If the virtual address cannot be translated to a physical address because the required page is not present in primary memory, a page fault is reported.

Define effective access time of a memory hierarchy

Hierarchical Memory Technology Memory in system is usually characterized as

appearing at various levels (0, 1, …) in a hierarchy, with level 0 being CPU registers and level 1 being the cache closest to the CPU.

Each level is characterized by five parameters: access time ti (round-trip time from CPU to ith level) memory size si (number of bytes or words in the

level) cost per byte ci

transfer bandwidth bi (rate of transfer between levels)

unit of transfer xi (grain size for transfers)

Memory Generalities

It is almost always the case that memories at lower-numbered levels, when compare to those at higher-numbered levels are faster to access, are smaller in capacity, are more expensive per byte, have a higher bandwidth, and have a smaller unit of transfer.

In general, then, ti-1 < ti, si-1 < si, ci-1 > ci, bi-1 > bi, and xi-1 < xi.

Hit Ratios

When a needed item (instruction or data) is found in the level of the memory hierarchy being examined, it is called a hit. Otherwise (when it is not found), it is called a miss (and the item must be obtained from a lower level in the hierarchy).

The hit ratio, h, for Mi is the probability (between 0 and 1) that a needed data item is found when sought in level memory Mi.

The miss ratio is obviously just 1-hi. We assume h0 = 0 and hn = 1.

Access Frequencies

The access frequency fi to level Mi is(1-h1) (1-h2) … hi.

Note that f1 = h1, and

n

iif

1

1

Effective Access Times

There are different penalties associated with misses at different levels in the memory hierarcy. A cache miss is typically 2 to 4 times as expensive as a

cache hit (assuming success at the next level). A page fault (miss) is 3 to 4 magnitudes as costly as a page

hit. The effective access time of a memory

hierarchy can be expressed as

1

1 1 1 2 2 1 2 1(1 ) (1 )(1 ) (1 )

n

eff i ii

n n n

T f t

h t h h t h h h h t

The first few terms in this expression dominate, but the effective access time is still dependent on program behavior and memory design choices.

Sep 2010 – Q8 (10 marks)

With a suitable example, distinguish between hardware and software parallelism

Hardware Parallelism Hardware parallelism is defined by machine

architecture and hardware multiplicity. It can be characterized by the number of

instructions that can be issued per machine cycle. If a processor issues k instructions per machine cycle, it is called a k-issue processor. Conventional processors are one-issue machines.

Examples. Intel i960CA is a three-issue processor (arithmetic, memory access, branch). IBM RS-6000 is a four-issue processor (arithmetic, floating-point, memory access, branch).

A machine with n k-issue processors should be able to handle a maximum of nk threads simultaneously.

Software Parallelism

Software parallelism is defined by the control and data dependence of programs, and is revealed in the program’s flow graph.

It is a function of algorithm, programming style, and compiler optimization.

Mismatch between software and hardware parallelism - 1

L1 L2 L3 L4

X1 X2

+ -

A B

Maximum software parallelism (L=load, X/+/- = arithmetic).

Cycle 1

Cycle 2

Cycle 3

Mismatch between software and hardware parallelism - 2

L1

L2

L4

L3X1

X2

+

-A

B

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5

Cycle 6

Cycle 7

Same problem, but considering the parallelism on a two-issue superscalar processor.

Sep 2010 – Q 9 (10 marks)

What is grain packing? With a suitable example, write a program graph before and after grain packing?

Grain Packing and Scheduling Two questions:

How can I partition a program into parallel “pieces” to yield the shortest execution time?

What is the optimal size of parallel grains? There is an obvious tradeoff between the

time spent scheduling and synchronizing parallel grains and the speedup obtained by parallel execution.

One approach to the problem is called “grain packing.”

Program Graphs and Packing A program graph is similar to a dependence

graph Nodes = { (n,s) }, where n = node name, s =

size (larger s = larger grain size). Edges = { (v,d) }, where v = variable being

“communicated,” and d = communication delay. Packing two (or more) nodes produces a node

with a larger grain size and possibly more edges to other nodes.

Packing is done to eliminate unnecessary communication delays or reduce overall scheduling overhead.

Scheduling

A schedule is a mapping of nodes to processors and start times such that communication delay requirements are observed, and no two nodes are executing on the same processor at the same time.

Some general scheduling goals Schedule all fine-grain activities in a node

to the same processor to minimize communication delays.

Select grain sizes for packing to achieve better schedules for a particular parallel machine.

Mar 2010 Q1 (10 marks)

Explain program partitioning and scheduling with example.

Program Partitioning & Scheduling

The size of the parts or pieces of a program that can be considered for parallel execution can vary.

The sizes are roughly classified using the term “granule size,” or simply “granularity.”

The simplest measure, for example, is the number of instructions in a program part.

Grain sizes are usually described as fine, medium or coarse, depending on the level of parallelism involved.

Latency

Latency is the time required for communication between different subsystems in a computer.

Memory latency, for example, is the time required by a processor to access memory.

Synchronization latency is the time required for two processes to synchronize their execution.

Computational granularity and communicatoin latency are closely related.

Levels of Parallelism

Jobs or programs

Instructionsor statements

Non-recursive loopsor unfolded iterations

Procedures, subroutines,tasks, or coroutines

Subprograms, job steps orrelated parts of a program

}}

Coarse grain

Medium grain

}Fine grain

Increasing communication demand

and scheduling overhead

Higher degree of

parallelism

Instruction Level Parallelism

This fine-grained, or smallest granularity level typically involves less than 20 instructions per grain. The number of candidates for parallel execution varies from 2 to thousands, with about five instructions or statements (on the average) being the average level of parallelism.

Advantages: There are usually many candidates for parallel

execution Compilers can usually do a reasonable job of

finding this parallelism

Loop-level Parallelism

Typical loop has less than 500 instructions.

If a loop operation is independent between iterations, it can be handled by a pipeline, or by a SIMD machine.

Most optimized program construct to execute on a parallel or vector machine

Some loops (e.g. recursive) are difficult to handle.

Loop-level parallelism is still considered fine grain computation.

Procedure-level Parallelism

Medium-sized grain; usually less than 2000 instructions.

Detection of parallelism is more difficult than with smaller grains; interprocedural dependence analysis is difficult and history-sensitive.

Communication requirement less than instruction-level

SPMD (single procedure multiple data) is a special case

Multitasking belongs to this level.

Subprogram-level Parallelism Job step level; grain typically has

thousands of instructions; medium- or coarse-grain level.

Job steps can overlap across different jobs.

Multiprograming conducted at this level No compilers available to exploit

medium- or coarse-grain parallelism at present.

Job or Program-Level Parallelism Corresponds to execution of essentially

independent jobs or programs on a parallel computer.

This is practical for a machine with a small number of powerful processors, but impractical for a machine with a large number of simple processors (since each processor would take too long to process a single job).

Summary

Fine-grain exploited at instruction or loop levels, assisted by the compiler.

Medium-grain (task or job step) requires programmer and compiler support.

Coarse-grain relies heavily on effective OS support.

Shared-variable communication used at fine- and medium-grain levels.

Message passing can be used for medium- and coarse-grain communication, but fine-grain really need better technique because of heavier communication requirements.

Communication Latency

Balancing granularity and latency can yield better performance.

Various latencies attributed to machine architecture, technology, and communication patterns used.

Latency imposes a limiting factor on machine scalability. Ex. Memory latency increases as memory capacity increases, limiting the amount of memory that can be used with a given tolerance for communication latency.

Interprocessor Communication Latency

Needs to be minimized by system designer

Affected by signal delays and communication patterns

Ex. n communicating tasks may require n (n - 1)/2 communication links, and the complexity grows quadratically, effectively limiting the number of processors in the system.

Mar 2010 Q2 (10 Marks)

Write notes on the following Parallelism profile of programs Harmonic mean performance

Degree of Parallelism

The number of processors used at any instant to execute a program is called the degree of parallelism (DOP); this can vary over time.

DOP assumes an infinite number of processors are available; this is not achievable in real machines, so some parallel program segments must be executed sequentially as smaller parallel segments. Other resources may impose limiting conditions.

A plot of DOP vs. time is called a parallelism profile.

Example Parallelism Profile

Time

DOP

AverageParallelism

t1 t2

Average Parallelism

2

1

)(t

t

dttDOPW

Available Parallelism

Various studies have shown that the potential parallelism in scientific and engineering calculations can be very high (e.g. hundreds or thousands of instructions per clock cycle).

But in real machines, the actual parallelism is much smaller (e.g. 10 or 20).

Mar 2010 Q8 (10 Marks)

Explain cache performance issues Cache performance issues –

cycle counts, hit ratios, effect of block size, effects of set number and others

Coherence Strategies

Write-through As soon as a data item in Mi is modified, immediate

update of the corresponding data item(s) in M i+1, Mi+2, … Mn is required. This is the most aggressive (and expensive) strategy.

Write-back The update of the data item in Mi+1 corresponding to a

modified item in Mi is not updated unit it (or the block/page/etc. in Mi that contains it) is replaced or removed. This is the most efficient approach, but cannot be used (without modification) when multiple processors share Mi+1, …, Mn.

Cycle count: # of m/c cycles needed for cache access, update, and coherence

Hit ratio: how effectively the cache can reduce the overall memory access time

Program trace driven simulation: present snapshots of program behavior and cache responses

Analytical modeling: provide insight into the underlying processes

EENG-630 Chapter 5

58

Cycle Counts

Cache speed affected by underlying static or dynamic RAM technology, organization, and hit ratios

Write-thru/write-back policies affect count

Cache size, block size, set number, and associativity affect count

Directly related to hit ratio

EENG-630 Chapter 5

59

Hit Ratio

Affected by cache size and block size Increases w.r.t. increasing cache size Limited cache size, initial loading, and

changes in locality prevent 100% hit ratio

EENG-630 Chapter 5

60

Effect of Block Size

With fixed cache size, block size has impact

As block size increases, hit ratio improves due to spatial locality

Peaks at optimum block size, then decreases

If too large, many words in cache not used

EENG-630 Chapter 5

61

Cache performance

Mar 2010 Q6

Bus Systems

A bus system is a hierarchy of buses connection various system and subsystem components.

Each bus has a complement of control, signal, and power lines.

There is usually a variety of buses in a system: Local bus – (usually integral to a system board) connects

various major system components (chips) Memory bus – used within a memory board to connect

the interface, the controller, and the memory cells Data bus – might be used on an I/O board or VLSI chip to

connect various components Backplane – like a local bus, but with connectors to

which other boards can be attached

Hierarchical Bus Systems

There are numerous ways in which buses, processors, memories, and I/O devices can be organized.

One organization has processors (and their caches) as leaf nodes in a tree, with the buses (and caches) to which these processors connect forming the interior nodes.

This generic organization, with appropriate protocols to ensure cache coherency, can model most hierarchical bus organizations.

EENG-630

66

Super pipelined Design

Degree n: pipeline cycle time = 1/n base cycles A fixed point addition takes one cycle in a base

scalar processor, takes n short cycles in a superpipelined processor

Issue rate = 1, issue latency = 1/n, ILP = n Requires high speed clocking

EENG-63067

EENG-630

68

Superpipelined Performance

N instructions, degree n, k stages T(1,n) = k + 1/n (N-1) S(1,n) = [n (k + N –1)] / [nk + N –1]

EENG-630

69

Superpipelined Superscalar

Degree (m,n): executes m instructions every cycle with pipeline cycle = 1/n of base cycle

Instruction issue latency = 1/n ILP = mn instructions

EENG-630

70

Superscalar SuperpipelinedPerformance

N independent instructions, degree (m,n) T(m,n) = k + (N-m) / mn S(m,n) = [mn (k + N – 1)] / [mnk + N – m]

EENG-630 71

Design Approaches

Superpipelined: Emphasizes temporal

parallelism Faster transistors Design must

minimize effects of clock skewing

Superscalar: Depends on spatial parallelism More transistors Better match for CMOS technology

Mar 2010 Q4

Explain buses and interfaces in multiprocessor system.

Generalized Multiprocessor System

Generalized Multiprocessor System

Each processor Pi is attached to its own local memory and private cache.

Multiple processors connected to shared memory through interprocessor memory network (IPMN).

Processors share access to I/O and peripherals through processor-I/O network (PION).

Both IPMN and PION are necessary in a shared-resource multiprocessor.

An optional interprocessor communication network (IPCN) can permit processor communication without using shared memory.

Interconnection Network Choices Timing

Synchronous – controlled by a global clock Asynchronous – use handshaking or interlock

mechanisms Switching Method

Circuit switching – a pair of communicating devices control the path for the entire duration of data transfer

Packet switching – large data transfers broken into smaller pieces, each of which can compete for use of the path

Network Control Centralized – global controller receives and acts on

requests Distributed – requests handled by local devices

independently

Digital Buses

Digital buses are the fundamental interconnects adopted in most commercial multiprocessor systems with less than 100 processors.

The principal limitation to the bus approach is packaging technology.

Complete bus specifications include logical, electrical and mechanical properties, application profiles, and interface requirements.

Bus Systems

A bus system is a hierarchy of buses connection various system and subsystem components.

Each bus has a complement of control, signal, and power lines.

There is usually a variety of buses in a system: Local bus – (usually integral to a system board) connects

various major system components (chips) Memory bus – used within a memory board to connect

the interface, the controller, and the memory cells Data bus – might be used on an I/O board or VLSI chip to

connect various components Backplane – like a local bus, but with connectors to

which other boards can be attached

Hierarchical Bus Systems

There are numerous ways in which buses, processors, memories, and I/O devices can be organized.

One organization has processors (and their caches) as leaf nodes in a tree, with the buses (and caches) to which these processors connect forming the interior nodes.

This generic organization, with appropriate protocols to ensure cache coherency, can model most hierarchical bus organizations.

Bridges

The term bridge is used to denote a device that is used to connect two (or possibly more) buses.

The interconnected buses may use the same standards, or they may be different (e.g. PCI and ISA buses in a modern PC).

Bridge functions include Communication protocol conversion Interrupt handling Serving as cache and memory agents

Cache addressing model

EENG-630 Chapter 5

81

Cache Addressing Models

Most systems use private caches for each processor

Have an interconnection n/w b/t caches and main memory

Address caches using either a physical address or virtual address

EENG-630 Chapter 5

82

Physical Address Caches

Cache is indexed and tagged with the physical address

Cache lookup occurs after address translation in TLB or MMU (no aliasing)

After cache miss, load a block from main memory

Use either write-back or write-through policy

EENG-630 Chapter 5 83

Physical Address Caches

Advantages: No cache flushing No aliasing problems Simplistic design Requires little intervention from OS kernel

Disadvantage: Slowdown in accessing cache until the

MMU/TLB finishes translation

EENG-630 Chapter 5

84

Physical Address Models

EENG-630 Chapter 5

85

Virtual Address Caches

Cache indexed or tagged w/virtual address

Cache and MMU translation/validation performed in parallel

Physical address saved in tags for write back

More efficient access to cache

EENG-630 Chapter 5

86

Virtual Address Model

EENG-630 Chapter 5

87

Aliasing Problem

Different logically addressed data have the same index/tag in the cache

Confusion if two or more processors access the same physical cache location

Flush cache when aliasing occurs, but leads to slowdown

Apply special tagging with a process key or with a physical address

EENG-630 Chapter 5

88

Block Placement Schemes

Performance depends upon cache access patterns, organization, and management policy

Blocks in caches are block frames, and blocks in main memory

Bi (i m), Bj (i n), n<<m, n=2s, m=2r Each block has b words b=2w, for cache

total of mb=2r+w words, main memory of nb= 2s+w words

EENG-630 Chapter 5

89

Direct Mapping Cache

Direct mapping of n/m memory blocks to one block frame in the cache

Placement is by using modulo-m function

Bj Bi if i=j mod m Unique block frame Bi that each Bj loads

to Simplest organization to implement

EENG-630 Chapter 5

90




Advantages Simple hardware No associative search No page replacement policy Lower cost Higher speed

Disadvantages Rigid mapping Poorer hit ratio Prohibits parallel virtual address translation Use larger cache size with more block

frames to avoid contention

EENG-630 Chapter 5

92

Fully Associative Cache

Each block in main memory can be placed in any of the available block frames

s-bit tag needed in each cache block (s > r)

An m-way associative search requires the tag to be compared w/ all cache block tags

Use an associative memory to achieve a parallel comparison w/all tags concurrently

EENG-630 Chapter 5

93

Fully Associative Cache


Fully Associative Caches

Advantages: Offers most flexibility in mapping cache

blocks Higher hit ratio Allows better block replacement policy with

reduced block contention

Disadvantages: Higher hardware cost Only moderate size cache Expensive search process

EENG-630 Chapter 5

95

Set Associative Caches

In a k-way associative cahe, the m cache block frames are divided into v=m/k sets, with k blocks per set

Each set is identified by a d-bit set number

Compare the tag w/the k tags w/in the identified set

Bj Bf Si if j(mod v) = i

EENG-630 Chapter 596

EENG-630 Chapter 5

97

Sector Mapping Cache

Partition cache and main memory into fixed size sectors then use fully associative search

Use sector tags for search and block fields within sector to find block

Only missing block loaded for a miss The ith block in a sector placed into the

th block frame in a destined sector frame

Attach a valid/invalid bit to block frames

Backplane bus system

EENG-630 Chapter 5

99

Backplane Bus Systems

System bus operates on contention basis Only one granted access to bus at a time Effective bandwidth available is inversely

proportional to # of contending processors

Simple and low cost (4 – 16 processors)

EENG-630 Chapter 5

100

Backplane Bus Specification

Interconnects processors, data storage, and I/O devices

Must allow communication b/t devices Timing protocols for arbitration Operational rules for orderly data

transfers Signal lines grouped into several buses

EENG-630 Chapter 5

101

Backplane Multiprocessor System

EENG-630 Chapter 5

102

Data Transfer Bus

Composed of data, address, and control lines

Address lines broadcast data and device address Proportional to log of address space size

Data lines proportional to memory word length

Control lines specify read/write, timing, and bus error conditions

EENG-630 Chapter 5

103

Bus Arbitration and Control

Arbitration: assign control of DTB Requester: master Receiving end:

slave Interrupt lines for prioritized interrupts Dedicated lines for synchronizing parallel

activities among processor modules Utility lines provide periodic timing and

coordinate the power-up/down sequences

Bus controller board houses control logic

EENG-630 Chapter 5

104

Functional Modules Arbiter: functional module that performs

arbitration Bus timer: measures time for data

transfers Interrupter: generates interrupt request

and provides status/ID to interrupt handler

Location monitor: monitors data transfer Power monitor: monitors power source System clock driver: provides clock

timing signal on the utility bus Board interface logic: matches signal

line impedance, prop. time, and termination values

EENG-630 Chapter 5

105

Physical Limitations

Electrical, mechanical, and packaging limitations restrict # of boards

Can mount multiple backplane buses on the same backplane chassis

Difficult to scale a bus system due to packaging constraints

Aug 2009 Q2

How do you classify pipelining processor? Give examples

Classification of Pipeline Processors Arithmetic Pipeline

The arithmetic logic units of a computer can be segmentized for pipeline operations in various data formats. Well-known arithmetic pipeline examples are the four-stage pipes used in Star-100

Instruction Pipelining : The execution of a stream of instruction can be

pipelined by overlapping the execution of the current instruction with the fetch, decode, and operand fetch of subsequent instruction. This technique is also known as instruction lookahead.

http://www.adiwebs.com/classification-of-pipeline-processors/




Processor Pipelining : This refers to the pipeline processing of the

same data stream by a cascade of processors, each of which processes a specific task. The data stream passes the first processor with results stored in a memory block which is also accessible by the second processor. The second processor then passes the refined results to the third, and so on

Next question part

Prove that the linear pipeline speedup is K times faster than the non-pipelined processor

Speedup and Efficiency

k-stage pipeline processes n tasks in k + (n-1) clock cycles:

k cycles for the first task and n-1 cycles for the remaining n-1 tasks

Total time to process n tasks

Tk = [ k + (n-1)]

For the non-pipelined processor

T1 = n k

Speedup factorSk =

T1

Tk =

n k [ k + (n-1)]

=n k

k + (n-1)

7

Efficiency and Throughput

Efficiency of the k-stages pipeline :

Ek =Sk

k =

n k + (n-1)

Pipeline throughput (the number of tasks per unit time) :note equivalence to IPC

Hk =n

[ k + (n-1)] =n f

k + (n-1)

10

Pipeline Performance: Example

Task has 4 subtasks with time: t1=60, t2=50, t3=90, and t4=80 ns (nanoseconds)

latch delay = 10 Pipeline cycle time = 90+10 = 100 ns For non-pipelined execution

time = 60+50+90+80 = 280 ns Speedup for above case is: 280/100 = 2.8 !! Pipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 ns Sequential time = 1000*280ns Throughput= 1000/1003 What is the problem here ? How to improve performance ?

Aug 2009 Q4

What do you understand by loosely and tightly coupling of processors in a multi-processor system? Explain the salient features of each type. Discuss the processor characteristics for multiprocessor system.

SHARED MEMORY MULTIPROCESSORS

Characteristics

All processors have equally direct access to one large memory address space

Example systems

- Bus and cache-based systems: Sequent Balance, Encore Multimax - Multistage IN-based systems: Ultracomputer, Butterfly, RP3, HEP - Crossbar switch-based systems: C.mmp, Alliant FX/8

Limitations

Memory access latency; Hot spot problem

Interconnection Network

. . .

. . .P PP

M MM

Buses,Multistage IN,Crossbar Switch

Characteristics of Multiprocessors

MESSAGE-PASSING MULTIPROCESSORS

Characteristics

- Interconnected computers - Each processor has its own memory, and

communicate via message-passing

Example systems

- Tree structure: Teradata, DADO - Mesh-connected: Rediflow, Series 2010, J-Machine - Hypercube: Cosmic Cube, iPSC, NCUBE, FPS T Series, Mark III

Limitations

- Communication overhead; Hard to programming

Message-Passing Network

. . .P PP

M M M. . .

Point-to-point connections

Characteristics of Multiprocessors

Aug 2009 / Q5

Write short notes on Flynn’s architectural classification Applications of parallel processing Fault tolerant computing

Flynn(1996) based on Instruction and Data• SISD – Single Instruction Single Data Stream

– Conventional uniprocessor system– Still a lot of intra-CPU parallelism options

• SIMD – Single Instruction Multiple DataStream– vector and array style computers – First accepted multiple PE style systems– Now has fallen behind MIMD option

• MISD – Multiple Instruction Single Data Stream – no commercial products

• MIMD –Multiple Instructions Multiple Data Stream - Intrinsic parallel computers

- Lots of options - today’s winner

Taxonomy of Parallel Architectures

Legends in Flynn’s Classification CU: Control Unit PU: Processor Unit MM: Memory Module SM: Shared Memory IS: Instruction Stream DS: Data Stream

Flynn’s Classification

Architecture Categories

SISD SIMD MISD MIMD

Classification based on notions of instruction and data stream

SISD

C P MIS IS DS

Uniprocessors

SIMD

C

P

P

MIS

DS

DS

Processors that execute same instruction on multiple pieces of data

MISDC

C

P

P

M

IS

IS

IS

IS

DS

DS

i. Same instruction executed by multiple processors using different data streams

ii. Each processor has its data memory (hence multiple data)

iii.There’s a single instruction memory and control processor

MIMD

C

C

P

P

M

IS

IS

IS

IS

DS

DS

MIMD Each processor fetches its own instructions

and operates on its own data

MIMD current winner: Concentrate on major design emphasis <= 128 processors

Use off-the-shelf microprocessors: cost-performance advantages

Flexible: high performance for one application, running many tasks simultaneously

Examples: Sun Enterprise 5000, Cray T3D, SGI Origin

Applications of parallel processing - 1

Predictive Modeling and Simulations: Numerical weather forecasting. Oceanography and astrophysics. Socioeconomics and government use.

Engineering Design and Automation: Finite element analysis Computational aerodynamics Remote sensing applications Artificial intelligence and automation

Applications of parallel processing - 2

Energy Resources Exploration Seismic exploration Reservoir modeling Plasma fusion power Nuclear reactor safety

Medical, Military, and Basic Research Computer – assisted tomography Genetic engineering Weapon research and defense Basic research problems

Fault Tolerant Computing – 1/3 Fault-tolerant computing is the art and

science of building computing systems that continue to operate satisfactorily in the presence of faults.

A fault-tolerant system may be able to tolerate one or more fault-types including transient, intermittent or permanent

hardware faults software and hardware design errors operator errors externally induced upsets or physical

damage

Fault Tolerant Computing – Basic Concept 2/3

Hardware Fault-Tolerance- Each module is backed up with protective redundancy

Fault masking - A number of identical modules execute the same functions, and their outputs are voted to remove errors created by a faulty module

Dynamic recovery - involves automated self-repair

Software Fault-Tolerance - Efforts to attain software that can tolerate software design faults (programming errors) have made use of static and dynamic redundancy approaches similar to those used for hardware faults

Aug 2009 / Q6

Describe the architectural features of any two of the following: Pentium III SPARC Architecture Power PC

Pentium III

The Pentium III[1] brand refers to Intel's 32-bit x86 desktop and mobile microprocessors based on the sixth-generation P6 microarchitectureintroduced on February 26, 1999

The most notable difference was the addition of the SSE instruction set (to accelerate floating point and parallel calculations), and the introduction of a controversial serial number embedded in the chip during the manufacturing process.

http://en.wikipedia.org/wiki/Pentium_III#cite_note-0

http://en.wikipedia.org/wiki/Intel

http://en.wikipedia.org/wiki/32-bit

http://en.wikipedia.org/wiki/X86

http://en.wikipedia.org/wiki/Microprocessor

http://en.wikipedia.org/wiki/P6_(microarchitecture)

http://en.wikipedia.org/wiki/P6_(microarchitecture)

http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

http://en.wikipedia.org/wiki/Instruction_set

http://en.wikipedia.org/wiki/Floating_point

RISC Scalar Processors

Designed to issue one instruction per cycle RISC and CISC scalar processors should have

same performance if clock rate and program lengths are equal.

RISC moves less frequent operations into software, thus dedicating hardware resources to the most frequently used operations.

Representative systems: Sun SPARC Intel i860 Motorola M88100 AMD 29000

Power PC

PowerPC (short for Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a RISCarchitecture created by the 1991 Apple–IBM–Motorola alliance, known as AIM.

Originally intended for personal computers, PowerPC CPUs have since become popular as embedded and high-performance processors.

PowerPC is largely based on IBM's earlier POWER architecture, and retains a high level of compatibility with it

http://en.wikipedia.org/wiki/RISC

http://en.wikipedia.org/wiki/Instruction_set_architecture

http://en.wikipedia.org/wiki/Apple_Inc.

http://en.wikipedia.org/wiki/IBM

http://en.wikipedia.org/wiki/Motorola

http://en.wikipedia.org/wiki/AIM_alliance

http://en.wikipedia.org/wiki/Personal_computer

http://en.wikipedia.org/wiki/Embedded_system

http://en.wikipedia.org/wiki/IBM_POWER

Aug 2009 / Q7

Explain two bus arbitration schemes for multiprocessor. What is cache coherence? Explain the static coherence check mechanism

EENG-630 Chapter 5

137

Arbitration

Process of selecting next bus master Bus tenure is duration of control Arbitrate on a fairness or priority basis Arbitration competition and bus

transactions take place concurrently on a parallel bus over separate lines

EENG-630 Chapter 5

138

Central Arbitration

Potential masters are daisy chained Signal line propagates bus-grant from

first master to the last master Only one bus-request line The bus-grant line activates the bus-

busy line

EENG-630 Chapter 5

139

Central Arbitration

EENG-630 Chapter 5

140

Central Arbitration

Simple scheme Easy to add devices Fixed-priority sequence – not fair Propagation of bus-grant signal is slow Not fault tolerant

EENG-630 Chapter 5

141

Independent Requests and Grants Provide independent bus-request and

grant signals for each master Require a central arbiter, but can use a

priority or fairness based policy More flexible and faster than a daisy-

chained policy Larger number of lines – costly

EENG-630 Chapter 5142

EENG-630 Chapter 5

143

Distributed Arbitration

Each master has its own arbiter and unique arbitration number

Use arbitration # to resolve competition Send # to SBRG lines and compare own

# with SBRG # Priority based scheme

Aug 2009 / Q8

How can data hazards be overcome by dynamic scheduling?

Chap. 3 -ILP 1

145

Dynamic Scheduling

3.1 Instruction Level Parallelism: Concepts and Challenges

3.2 Overcoming Data Hazards with Dynamic Scheduling

3.3 Dynamic Scheduling: Examples & The Algorithm

3.4 Reducing Branch Penalties with Dynamic Hardware Prediction

3.5 High Performance Instruction Delivery

3.6 Taking Advantage of More ILP with Multiple Issue

3.7 Hardware-based Speculation

3.8 Studies of The Limitations of ILP

3.10 The Pentium 4

• Handles cases when dependences unknown at compile time – (e.g., because they may

involve a memory reference)

• It simplifies the compiler • Allows code that compiled

for one pipeline to run efficiently on a different pipeline

• Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling

Advantages ofDynamic Scheduling

Chap. 3 -ILP 1

146

Dynamic Scheduling

Why is this in Hardware at run time? Works when can’t know real dependence at compile time Compiler simpler Code for one machine runs well on another

Key Idea: Allow instructions behind stall to proceed. Key Idea: Instructions executing in parallel. There are multiple

execution units, so use them.

DIVD F0,F2,F4

ADDD F10,F0,F8

SUBD F12,F8,F14 Enables out-of-order execution => out-of-order completion

The idea:

HW Schemes: Instruction Parallelism

Even though ADDD stalls, theSUBD has no dependencies

and can run.

Chap. 3 -ILP 1

147

Dynamic Scheduling

Out-of-order execution divides ID stage:1. Issue—decode instructions, check for structural hazards

2. Read operands—wait until no data hazards, then read operands

Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions.

A scoreboard is a “data structure” that provides the information necessary for all pieces of the processor to work together.

We will use In order issue, out of order execution, out of order commit ( also called completion)

First used in CDC6600. Our example modified here for MIPS. CDC had 4 FP units, 5 memory reference units, 7 integer units. MIPS has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.

The idea:HW Schemes: Instruction

Parallelism

Chap. 3 -ILP 1

148

Scoreboard Implications

Out-of-order completion => WAR, WAW hazards? Solutions for WAR

Queue both the operation and copies of its operands Read registers only during Read Operands stage

For WAW, must detect hazard: stall until other completes Need to have multiple instructions in execution phase => multiple

execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations Scoreboard replaces ID, EX, WB with 4 stages

Dynamic Scheduling

Using A Scoreboard

Chap. 3 -ILP 1

149 Four Stages of Scoreboard Control

1. Issue —decode instructions & check for structural hazards (ID1)

If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure.

If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared.

Dynamic Scheduling

Using A Scoreboard

Chap. 3 -ILP 1

150 Four Stages of Scoreboard Control

2. Read operands —wait until no data hazards, then read operands (ID2)

A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit.

When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.

Dynamic Scheduling

Using A Scoreboard

Chap. 3 -ILP 1

151

Four Stages of Scoreboard Control3. Execution —operate on operands (EX)

The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution.

4. Write result —finish execution (WB) Once the scoreboard is aware that the functional

unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction.Example:

DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 Scoreboard would stall SUBD until ADDD reads

operands

Dynamic Scheduling

Using A Scoreboard

Chap. 3 -ILP 1

152

Three Parts of the Scoreboard

1. Instruction status—which of 4 steps the instruction is in

2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit

Busy—Indicates whether the unit is busy or not

Op—Operation to perform in the unit (e.g., + or –)

Fi—Destination register

Fj, Fk—Source-register numbers

Qj, Qk—Functional units producing source registers Fj, Fk

Rj, Rk—Flags indicating when Fj, Fk are ready

3. Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register

Dynamic Scheduling

Using A Scoreboard

Chap. 3 -ILP 1

153

Detailed Scoreboard Pipeline Control

Read operandsExecutio

n complete

Instruction status

Write result

Issue

Bookkeeping

Rj No; Rk No

f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No

Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’;

Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj; Rk not Qk; Result(‘D’) FU;

Rj and Rk

Functional unit done

Wait until

f((Fj( f )≠Fi(FU)

or Rj( f )=No) & (Fk( f )

≠Fi(FU) or Rk( f )=No))

Not busy (FU) and not

result(D)

Dynamic Scheduling

Using A Scoreboard

Aug 2009 / Q9

With the help of the block diagram, explain the crossbar switch system organization for multiprocessors

Derive expression for speed-up, efficiency and throughput of a pipelined processor. What are their ideal values.

May 2009 / Q1

Documents

ACA Answer Key Best of All v2