Upload
neeraj-singh
View
506
Download
0
Embed Size (px)
DESCRIPTION
BEST OF ALL QUESTION PAPERSAll Question papers & assignments, TestSep 2010 - Question 1Write the neat functional structure of an SIMD array processor with concurrent scalar processing in the control unit. (10 marks)SIMD SupercomputersOperational model is a 5-tuple (N, C, I, M, R).= number of processing elements (PEs). C = set of instructions (including scalar and flow control) I = set of instructions broadcast to all PEs for parallel execution. M = set of masking schemes used to
Citation preview
BEST OF ALL QUESTION PAPERSAll Question papers & assignments, Test
Sep 2010 - Question 1
Write the neat functional structure of an SIMD array processor with concurrent scalar processing in the control unit. (10 marks)
SIMD Supercomputers
Operational model is a 5-tuple (N, C, I, M, R). N = number of processing elements (PEs). C = set of instructions (including scalar and
flow control) I = set of instructions broadcast to all PEs for
parallel execution. M = set of masking schemes used to partion
PEs into enabled/disabled states. R = set of data-routing functions to enable
inter-PE communication through the interconnection network.
Operational Model of SIMD Computer
Interconnection Network
…
P
M
Control Unit
P
M
P
M
Sep 2010 - Question 2
Define clock rate, CPI, MIPS rate and throughput rate. (10 marks)
Clock Rate
Clock rate - CPU is driven by a clock with a constant cycle time Cycle time is represented using T in
nanoseconds Inverse of cycle time is the clock rate
(f=1/T) f = 1 in megahertz
The clock rate is the rate in cycles per second (measured in hertz) or the frequency of the clock in any synchronous circuit, such as a central processing unit (CPU).
CPI (Cycles per instructions)
MIPS(Million of Instructions per second)
Throughput rate
Sep 2010 – Question 5
What are the reservation tables in pipelining? Mention its advantages.
EENG-630
11
Reservation Table
Specifies utilization pattern of successive stages
Follows a diagonal streamline Need k clock cycles to flow through One result emerges at each cycle if tasks
are independent of each other
Sep 2010 – Q 6
Explain TLB, Paging and Segmentation in virtual memory
Virtual Memory
To facilitate the use of memory hierarchies, the memory addresses normally generated by modern processors executing application programs are not physical addresses, but are rather virtual addresses of data items and instructions.
Physical addresses, of course, are used to reference the available locations in the real physical memory of a system.
Virtual addresses must be mapped to physical addresses before they can be used.
Mapping Efficiency
Efficient implementations are more difficult in multiprocessor systems where additional problems such as coherence, protection, and consistency must be addressed.
Virtual Memory Models (1)
Private Virtual Memory In this scheme, each processor has a separate
virtual address space, but all processors share the same physical address space.
Virtual Memory Models (2)
Shared Virtual Memory All processors share a single shared virtual
address space, with each processor being given a portion of it.
Memory Allocation
Both the virtual address space and the physical address space are divided into fixed-length pieces. In the virtual address space these pieces
are called pages. In the physical address space they are
called page frames. The purpose of memory allocation is to
allocate pages of virtual memory using the page frames of physical memory.
Address Translation Mechanisms [Virtual to physical] address translation requires
use of a translation map. The virtual address can be used with a hash function to
locate the translation map (which is stored in the cache, an associative memory, or in main memory).
The translation map is comprised of a translation lookaside buffer, or TLB (usually in associative memory) and a page table (or tables). The virtual address is first sought in the TLB, and if that search succeeds, not further translation is necessary. Otherwise, the page table(s) must be referenced to obtain the translation result.
If the virtual address cannot be translated to a physical address because the required page is not present in primary memory, a page fault is reported.
Define effective access time of a memory hierarchy
Hierarchical Memory Technology Memory in system is usually characterized as
appearing at various levels (0, 1, …) in a hierarchy, with level 0 being CPU registers and level 1 being the cache closest to the CPU.
Each level is characterized by five parameters: access time ti (round-trip time from CPU to ith level) memory size si (number of bytes or words in the
level) cost per byte ci
transfer bandwidth bi (rate of transfer between levels)
unit of transfer xi (grain size for transfers)
Memory Generalities
It is almost always the case that memories at lower-numbered levels, when compare to those at higher-numbered levels are faster to access, are smaller in capacity, are more expensive per byte, have a higher bandwidth, and have a smaller unit of transfer.
In general, then, ti-1 < ti, si-1 < si, ci-1 > ci, bi-1 > bi, and xi-1 < xi.
Hit Ratios
When a needed item (instruction or data) is found in the level of the memory hierarchy being examined, it is called a hit. Otherwise (when it is not found), it is called a miss (and the item must be obtained from a lower level in the hierarchy).
The hit ratio, h, for Mi is the probability (between 0 and 1) that a needed data item is found when sought in level memory Mi.
The miss ratio is obviously just 1-hi. We assume h0 = 0 and hn = 1.
Access Frequencies
The access frequency fi to level Mi is(1-h1) (1-h2) … hi.
Note that f1 = h1, and
n
iif
1
1
Effective Access Times
There are different penalties associated with misses at different levels in the memory hierarcy. A cache miss is typically 2 to 4 times as expensive as a
cache hit (assuming success at the next level). A page fault (miss) is 3 to 4 magnitudes as costly as a page
hit. The effective access time of a memory
hierarchy can be expressed as
1
1 1 1 2 2 1 2 1(1 ) (1 )(1 ) (1 )
n
eff i ii
n n n
T f t
h t h h t h h h h t
The first few terms in this expression dominate, but the effective access time is still dependent on program behavior and memory design choices.
Sep 2010 – Q8 (10 marks)
With a suitable example, distinguish between hardware and software parallelism
Hardware Parallelism Hardware parallelism is defined by machine
architecture and hardware multiplicity. It can be characterized by the number of
instructions that can be issued per machine cycle. If a processor issues k instructions per machine cycle, it is called a k-issue processor. Conventional processors are one-issue machines.
Examples. Intel i960CA is a three-issue processor (arithmetic, memory access, branch). IBM RS-6000 is a four-issue processor (arithmetic, floating-point, memory access, branch).
A machine with n k-issue processors should be able to handle a maximum of nk threads simultaneously.
Software Parallelism
Software parallelism is defined by the control and data dependence of programs, and is revealed in the program’s flow graph.
It is a function of algorithm, programming style, and compiler optimization.
Mismatch between software and hardware parallelism - 1
L1 L2 L3 L4
X1 X2
+ -
A B
Maximum software parallelism (L=load, X/+/- = arithmetic).
Cycle 1
Cycle 2
Cycle 3
Mismatch between software and hardware parallelism - 2
L1
L2
L4
L3X1
X2
+
-A
B
Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5
Cycle 6
Cycle 7
Same problem, but considering the parallelism on a two-issue superscalar processor.
Sep 2010 – Q 9 (10 marks)
What is grain packing? With a suitable example, write a program graph before and after grain packing?
Grain Packing and Scheduling Two questions:
How can I partition a program into parallel “pieces” to yield the shortest execution time?
What is the optimal size of parallel grains? There is an obvious tradeoff between the
time spent scheduling and synchronizing parallel grains and the speedup obtained by parallel execution.
One approach to the problem is called “grain packing.”
Program Graphs and Packing A program graph is similar to a dependence
graph Nodes = { (n,s) }, where n = node name, s =
size (larger s = larger grain size). Edges = { (v,d) }, where v = variable being
“communicated,” and d = communication delay. Packing two (or more) nodes produces a node
with a larger grain size and possibly more edges to other nodes.
Packing is done to eliminate unnecessary communication delays or reduce overall scheduling overhead.
Scheduling
A schedule is a mapping of nodes to processors and start times such that communication delay requirements are observed, and no two nodes are executing on the same processor at the same time.
Some general scheduling goals Schedule all fine-grain activities in a node
to the same processor to minimize communication delays.
Select grain sizes for packing to achieve better schedules for a particular parallel machine.
Mar 2010 Q1 (10 marks)
Explain program partitioning and scheduling with example.
Program Partitioning & Scheduling
The size of the parts or pieces of a program that can be considered for parallel execution can vary.
The sizes are roughly classified using the term “granule size,” or simply “granularity.”
The simplest measure, for example, is the number of instructions in a program part.
Grain sizes are usually described as fine, medium or coarse, depending on the level of parallelism involved.
Latency
Latency is the time required for communication between different subsystems in a computer.
Memory latency, for example, is the time required by a processor to access memory.
Synchronization latency is the time required for two processes to synchronize their execution.
Computational granularity and communicatoin latency are closely related.
Levels of Parallelism
Jobs or programs
Instructionsor statements
Non-recursive loopsor unfolded iterations
Procedures, subroutines,tasks, or coroutines
Subprograms, job steps orrelated parts of a program
}}
Coarse grain
Medium grain
}Fine grain
Increasing communication demand
and scheduling overhead
Higher degree of
parallelism
Instruction Level Parallelism
This fine-grained, or smallest granularity level typically involves less than 20 instructions per grain. The number of candidates for parallel execution varies from 2 to thousands, with about five instructions or statements (on the average) being the average level of parallelism.
Advantages: There are usually many candidates for parallel
execution Compilers can usually do a reasonable job of
finding this parallelism
Loop-level Parallelism
Typical loop has less than 500 instructions.
If a loop operation is independent between iterations, it can be handled by a pipeline, or by a SIMD machine.
Most optimized program construct to execute on a parallel or vector machine
Some loops (e.g. recursive) are difficult to handle.
Loop-level parallelism is still considered fine grain computation.
Procedure-level Parallelism
Medium-sized grain; usually less than 2000 instructions.
Detection of parallelism is more difficult than with smaller grains; interprocedural dependence analysis is difficult and history-sensitive.
Communication requirement less than instruction-level
SPMD (single procedure multiple data) is a special case
Multitasking belongs to this level.
Subprogram-level Parallelism Job step level; grain typically has
thousands of instructions; medium- or coarse-grain level.
Job steps can overlap across different jobs.
Multiprograming conducted at this level No compilers available to exploit
medium- or coarse-grain parallelism at present.
Job or Program-Level Parallelism Corresponds to execution of essentially
independent jobs or programs on a parallel computer.
This is practical for a machine with a small number of powerful processors, but impractical for a machine with a large number of simple processors (since each processor would take too long to process a single job).
Summary
Fine-grain exploited at instruction or loop levels, assisted by the compiler.
Medium-grain (task or job step) requires programmer and compiler support.
Coarse-grain relies heavily on effective OS support.
Shared-variable communication used at fine- and medium-grain levels.
Message passing can be used for medium- and coarse-grain communication, but fine-grain really need better technique because of heavier communication requirements.
Communication Latency
Balancing granularity and latency can yield better performance.
Various latencies attributed to machine architecture, technology, and communication patterns used.
Latency imposes a limiting factor on machine scalability. Ex. Memory latency increases as memory capacity increases, limiting the amount of memory that can be used with a given tolerance for communication latency.
Interprocessor Communication Latency
Needs to be minimized by system designer
Affected by signal delays and communication patterns
Ex. n communicating tasks may require n (n - 1)/2 communication links, and the complexity grows quadratically, effectively limiting the number of processors in the system.
Mar 2010 Q2 (10 Marks)
Write notes on the following Parallelism profile of programs Harmonic mean performance
Degree of Parallelism
The number of processors used at any instant to execute a program is called the degree of parallelism (DOP); this can vary over time.
DOP assumes an infinite number of processors are available; this is not achievable in real machines, so some parallel program segments must be executed sequentially as smaller parallel segments. Other resources may impose limiting conditions.
A plot of DOP vs. time is called a parallelism profile.
Example Parallelism Profile
Time
DOP
AverageParallelism
t1 t2
Average Parallelism
2
1
)(t
t
dttDOPW
Available Parallelism
Various studies have shown that the potential parallelism in scientific and engineering calculations can be very high (e.g. hundreds or thousands of instructions per clock cycle).
But in real machines, the actual parallelism is much smaller (e.g. 10 or 20).
Mar 2010 Q8 (10 Marks)
Explain cache performance issues Cache performance issues –
cycle counts, hit ratios, effect of block size, effects of set number and others
Coherence Strategies
Write-through As soon as a data item in Mi is modified, immediate
update of the corresponding data item(s) in M i+1, Mi+2, … Mn is required. This is the most aggressive (and expensive) strategy.
Write-back The update of the data item in Mi+1 corresponding to a
modified item in Mi is not updated unit it (or the block/page/etc. in Mi that contains it) is replaced or removed. This is the most efficient approach, but cannot be used (without modification) when multiple processors share Mi+1, …, Mn.
Cycle count: # of m/c cycles needed for cache access, update, and coherence
Hit ratio: how effectively the cache can reduce the overall memory access time
Program trace driven simulation: present snapshots of program behavior and cache responses
Analytical modeling: provide insight into the underlying processes
EENG-630 Chapter 5
58
Cycle Counts
Cache speed affected by underlying static or dynamic RAM technology, organization, and hit ratios
Write-thru/write-back policies affect count
Cache size, block size, set number, and associativity affect count
Directly related to hit ratio
EENG-630 Chapter 5
59
Hit Ratio
Affected by cache size and block size Increases w.r.t. increasing cache size Limited cache size, initial loading, and
changes in locality prevent 100% hit ratio
EENG-630 Chapter 5
60
Effect of Block Size
With fixed cache size, block size has impact
As block size increases, hit ratio improves due to spatial locality
Peaks at optimum block size, then decreases
If too large, many words in cache not used
EENG-630 Chapter 5
61
Cache performance
Mar 2010 Q6
Bus Systems
A bus system is a hierarchy of buses connection various system and subsystem components.
Each bus has a complement of control, signal, and power lines.
There is usually a variety of buses in a system: Local bus – (usually integral to a system board) connects
various major system components (chips) Memory bus – used within a memory board to connect
the interface, the controller, and the memory cells Data bus – might be used on an I/O board or VLSI chip to
connect various components Backplane – like a local bus, but with connectors to
which other boards can be attached
Hierarchical Bus Systems
There are numerous ways in which buses, processors, memories, and I/O devices can be organized.
One organization has processors (and their caches) as leaf nodes in a tree, with the buses (and caches) to which these processors connect forming the interior nodes.
This generic organization, with appropriate protocols to ensure cache coherency, can model most hierarchical bus organizations.
EENG-630
66
Super pipelined Design
Degree n: pipeline cycle time = 1/n base cycles A fixed point addition takes one cycle in a base
scalar processor, takes n short cycles in a superpipelined processor
Issue rate = 1, issue latency = 1/n, ILP = n Requires high speed clocking
EENG-63067
EENG-630
68
Superpipelined Performance
N instructions, degree n, k stages T(1,n) = k + 1/n (N-1) S(1,n) = [n (k + N –1)] / [nk + N –1]
EENG-630
69
Superpipelined Superscalar
Degree (m,n): executes m instructions every cycle with pipeline cycle = 1/n of base cycle
Instruction issue latency = 1/n ILP = mn instructions
EENG-630
70
Superscalar SuperpipelinedPerformance
N independent instructions, degree (m,n) T(m,n) = k + (N-m) / mn S(m,n) = [mn (k + N – 1)] / [mnk + N – m]
EENG-630 71
Design Approaches
Superpipelined: Emphasizes temporal
parallelism Faster transistors Design must
minimize effects of clock skewing
Superscalar: Depends on spatial parallelism More transistors Better match for CMOS technology
Mar 2010 Q4
Explain buses and interfaces in multiprocessor system.
Generalized Multiprocessor System
Generalized Multiprocessor System
Each processor Pi is attached to its own local memory and private cache.
Multiple processors connected to shared memory through interprocessor memory network (IPMN).
Processors share access to I/O and peripherals through processor-I/O network (PION).
Both IPMN and PION are necessary in a shared-resource multiprocessor.
An optional interprocessor communication network (IPCN) can permit processor communication without using shared memory.
Interconnection Network Choices Timing
Synchronous – controlled by a global clock Asynchronous – use handshaking or interlock
mechanisms Switching Method
Circuit switching – a pair of communicating devices control the path for the entire duration of data transfer
Packet switching – large data transfers broken into smaller pieces, each of which can compete for use of the path
Network Control Centralized – global controller receives and acts on
requests Distributed – requests handled by local devices
independently
Digital Buses
Digital buses are the fundamental interconnects adopted in most commercial multiprocessor systems with less than 100 processors.
The principal limitation to the bus approach is packaging technology.
Complete bus specifications include logical, electrical and mechanical properties, application profiles, and interface requirements.
Bus Systems
A bus system is a hierarchy of buses connection various system and subsystem components.
Each bus has a complement of control, signal, and power lines.
There is usually a variety of buses in a system: Local bus – (usually integral to a system board) connects
various major system components (chips) Memory bus – used within a memory board to connect
the interface, the controller, and the memory cells Data bus – might be used on an I/O board or VLSI chip to
connect various components Backplane – like a local bus, but with connectors to
which other boards can be attached
Hierarchical Bus Systems
There are numerous ways in which buses, processors, memories, and I/O devices can be organized.
One organization has processors (and their caches) as leaf nodes in a tree, with the buses (and caches) to which these processors connect forming the interior nodes.
This generic organization, with appropriate protocols to ensure cache coherency, can model most hierarchical bus organizations.
Bridges
The term bridge is used to denote a device that is used to connect two (or possibly more) buses.
The interconnected buses may use the same standards, or they may be different (e.g. PCI and ISA buses in a modern PC).
Bridge functions include Communication protocol conversion Interrupt handling Serving as cache and memory agents
Cache addressing model
EENG-630 Chapter 5
81
Cache Addressing Models
Most systems use private caches for each processor
Have an interconnection n/w b/t caches and main memory
Address caches using either a physical address or virtual address
EENG-630 Chapter 5
82
Physical Address Caches
Cache is indexed and tagged with the physical address
Cache lookup occurs after address translation in TLB or MMU (no aliasing)
After cache miss, load a block from main memory
Use either write-back or write-through policy
EENG-630 Chapter 5 83
Physical Address Caches
Advantages: No cache flushing No aliasing problems Simplistic design Requires little intervention from OS kernel
Disadvantage: Slowdown in accessing cache until the
MMU/TLB finishes translation
EENG-630 Chapter 5
84
Physical Address Models
EENG-630 Chapter 5
85
Virtual Address Caches
Cache indexed or tagged w/virtual address
Cache and MMU translation/validation performed in parallel
Physical address saved in tags for write back
More efficient access to cache
EENG-630 Chapter 5
86
Virtual Address Model
EENG-630 Chapter 5
87
Aliasing Problem
Different logically addressed data have the same index/tag in the cache
Confusion if two or more processors access the same physical cache location
Flush cache when aliasing occurs, but leads to slowdown
Apply special tagging with a process key or with a physical address
EENG-630 Chapter 5
88
Block Placement Schemes
Performance depends upon cache access patterns, organization, and management policy
Blocks in caches are block frames, and blocks in main memory
Bi (i m), Bj (i n), n<<m, n=2s, m=2r Each block has b words b=2w, for cache
total of mb=2r+w words, main memory of nb= 2s+w words
EENG-630 Chapter 5
89
Direct Mapping Cache
Direct mapping of n/m memory blocks to one block frame in the cache
Placement is by using modulo-m function
Bj Bi if i=j mod m Unique block frame Bi that each Bj loads
to Simplest organization to implement
EENG-630 Chapter 5
90
Direct Mapping Cache
EENG-630 Chapter 5 91
Direct Mapping Cache
Advantages Simple hardware No associative search No page replacement policy Lower cost Higher speed
Disadvantages Rigid mapping Poorer hit ratio Prohibits parallel virtual address translation Use larger cache size with more block
frames to avoid contention
EENG-630 Chapter 5
92
Fully Associative Cache
Each block in main memory can be placed in any of the available block frames
s-bit tag needed in each cache block (s > r)
An m-way associative search requires the tag to be compared w/ all cache block tags
Use an associative memory to achieve a parallel comparison w/all tags concurrently
EENG-630 Chapter 5
93
Fully Associative Cache
EENG-630 Chapter 5 94
Fully Associative Caches
Advantages: Offers most flexibility in mapping cache
blocks Higher hit ratio Allows better block replacement policy with
reduced block contention
Disadvantages: Higher hardware cost Only moderate size cache Expensive search process
EENG-630 Chapter 5
95
Set Associative Caches
In a k-way associative cahe, the m cache block frames are divided into v=m/k sets, with k blocks per set
Each set is identified by a d-bit set number
Compare the tag w/the k tags w/in the identified set
Bj Bf Si if j(mod v) = i
EENG-630 Chapter 596
EENG-630 Chapter 5
97
Sector Mapping Cache
Partition cache and main memory into fixed size sectors then use fully associative search
Use sector tags for search and block fields within sector to find block
Only missing block loaded for a miss The ith block in a sector placed into the
th block frame in a destined sector frame
Attach a valid/invalid bit to block frames
Backplane bus system
EENG-630 Chapter 5
99
Backplane Bus Systems
System bus operates on contention basis Only one granted access to bus at a time Effective bandwidth available is inversely
proportional to # of contending processors
Simple and low cost (4 – 16 processors)
EENG-630 Chapter 5
100
Backplane Bus Specification
Interconnects processors, data storage, and I/O devices
Must allow communication b/t devices Timing protocols for arbitration Operational rules for orderly data
transfers Signal lines grouped into several buses
EENG-630 Chapter 5
101
Backplane Multiprocessor System
EENG-630 Chapter 5
102
Data Transfer Bus
Composed of data, address, and control lines
Address lines broadcast data and device address Proportional to log of address space size
Data lines proportional to memory word length
Control lines specify read/write, timing, and bus error conditions
EENG-630 Chapter 5
103
Bus Arbitration and Control
Arbitration: assign control of DTB Requester: master Receiving end:
slave Interrupt lines for prioritized interrupts Dedicated lines for synchronizing parallel
activities among processor modules Utility lines provide periodic timing and
coordinate the power-up/down sequences
Bus controller board houses control logic
EENG-630 Chapter 5
104
Functional Modules Arbiter: functional module that performs
arbitration Bus timer: measures time for data
transfers Interrupter: generates interrupt request
and provides status/ID to interrupt handler
Location monitor: monitors data transfer Power monitor: monitors power source System clock driver: provides clock
timing signal on the utility bus Board interface logic: matches signal
line impedance, prop. time, and termination values
EENG-630 Chapter 5
105
Physical Limitations
Electrical, mechanical, and packaging limitations restrict # of boards
Can mount multiple backplane buses on the same backplane chassis
Difficult to scale a bus system due to packaging constraints
Aug 2009 Q2
How do you classify pipelining processor? Give examples
Classification of Pipeline Processors Arithmetic Pipeline
The arithmetic logic units of a computer can be segmentized for pipeline operations in various data formats. Well-known arithmetic pipeline examples are the four-stage pipes used in Star-100
Instruction Pipelining : The execution of a stream of instruction can be
pipelined by overlapping the execution of the current instruction with the fetch, decode, and operand fetch of subsequent instruction. This technique is also known as instruction lookahead.
Processor Pipelining : This refers to the pipeline processing of the
same data stream by a cascade of processors, each of which processes a specific task. The data stream passes the first processor with results stored in a memory block which is also accessible by the second processor. The second processor then passes the refined results to the third, and so on
Next question part
Prove that the linear pipeline speedup is K times faster than the non-pipelined processor
Speedup and Efficiency
k-stage pipeline processes n tasks in k + (n-1) clock cycles:
k cycles for the first task and n-1 cycles for the remaining n-1 tasks
Total time to process n tasks
Tk = [ k + (n-1)]
For the non-pipelined processor
T1 = n k
Speedup factorSk =
T1
Tk =
n k [ k + (n-1)]
=n k
k + (n-1)
7
Efficiency and Throughput
Efficiency of the k-stages pipeline :
Ek =Sk
k =
n k + (n-1)
Pipeline throughput (the number of tasks per unit time) :note equivalence to IPC
Hk =n
[ k + (n-1)] =n f
k + (n-1)
10
Pipeline Performance: Example
Task has 4 subtasks with time: t1=60, t2=50, t3=90, and t4=80 ns (nanoseconds)
latch delay = 10 Pipeline cycle time = 90+10 = 100 ns For non-pipelined execution
time = 60+50+90+80 = 280 ns Speedup for above case is: 280/100 = 2.8 !! Pipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 ns Sequential time = 1000*280ns Throughput= 1000/1003 What is the problem here ? How to improve performance ?
Aug 2009 Q4
What do you understand by loosely and tightly coupling of processors in a multi-processor system? Explain the salient features of each type. Discuss the processor characteristics for multiprocessor system.
SHARED MEMORY MULTIPROCESSORS
Characteristics
All processors have equally direct access to one large memory address space
Example systems
- Bus and cache-based systems: Sequent Balance, Encore Multimax - Multistage IN-based systems: Ultracomputer, Butterfly, RP3, HEP - Crossbar switch-based systems: C.mmp, Alliant FX/8
Limitations
Memory access latency; Hot spot problem
Interconnection Network
. . .
. . .P PP
M MM
Buses,Multistage IN,Crossbar Switch
Characteristics of Multiprocessors
MESSAGE-PASSING MULTIPROCESSORS
Characteristics
- Interconnected computers - Each processor has its own memory, and
communicate via message-passing
Example systems
- Tree structure: Teradata, DADO - Mesh-connected: Rediflow, Series 2010, J-Machine - Hypercube: Cosmic Cube, iPSC, NCUBE, FPS T Series, Mark III
Limitations
- Communication overhead; Hard to programming
Message-Passing Network
. . .P PP
M M M. . .
Point-to-point connections
Characteristics of Multiprocessors
Aug 2009 / Q5
Write short notes on Flynn’s architectural classification Applications of parallel processing Fault tolerant computing
Flynn(1996) based on Instruction and Data• SISD – Single Instruction Single Data Stream
– Conventional uniprocessor system– Still a lot of intra-CPU parallelism options
• SIMD – Single Instruction Multiple DataStream– vector and array style computers – First accepted multiple PE style systems– Now has fallen behind MIMD option
• MISD – Multiple Instruction Single Data Stream – no commercial products
• MIMD –Multiple Instructions Multiple Data Stream - Intrinsic parallel computers
- Lots of options - today’s winner
Taxonomy of Parallel Architectures
Legends in Flynn’s Classification CU: Control Unit PU: Processor Unit MM: Memory Module SM: Shared Memory IS: Instruction Stream DS: Data Stream
Flynn’s Classification
Architecture Categories
SISD SIMD MISD MIMD
Classification based on notions of instruction and data stream
SISD
C P MIS IS DS
Uniprocessors
SIMD
C
P
P
MIS
DS
DS
Processors that execute same instruction on multiple pieces of data
MISDC
C
P
P
M
IS
IS
IS
IS
DS
DS
i. Same instruction executed by multiple processors using different data streams
ii. Each processor has its data memory (hence multiple data)
iii.There’s a single instruction memory and control processor
MIMD
C
C
P
P
M
IS
IS
IS
IS
DS
DS
MIMD Each processor fetches its own instructions
and operates on its own data
MIMD current winner: Concentrate on major design emphasis <= 128 processors
Use off-the-shelf microprocessors: cost-performance advantages
Flexible: high performance for one application, running many tasks simultaneously
Examples: Sun Enterprise 5000, Cray T3D, SGI Origin
Applications of parallel processing - 1
Predictive Modeling and Simulations: Numerical weather forecasting. Oceanography and astrophysics. Socioeconomics and government use.
Engineering Design and Automation: Finite element analysis Computational aerodynamics Remote sensing applications Artificial intelligence and automation
Applications of parallel processing - 2
Energy Resources Exploration Seismic exploration Reservoir modeling Plasma fusion power Nuclear reactor safety
Medical, Military, and Basic Research Computer – assisted tomography Genetic engineering Weapon research and defense Basic research problems
Fault Tolerant Computing – 1/3 Fault-tolerant computing is the art and
science of building computing systems that continue to operate satisfactorily in the presence of faults.
A fault-tolerant system may be able to tolerate one or more fault-types including transient, intermittent or permanent
hardware faults software and hardware design errors operator errors externally induced upsets or physical
damage
Fault Tolerant Computing – Basic Concept 2/3
Hardware Fault-Tolerance- Each module is backed up with protective redundancy
Fault masking - A number of identical modules execute the same functions, and their outputs are voted to remove errors created by a faulty module
Dynamic recovery - involves automated self-repair
Software Fault-Tolerance - Efforts to attain software that can tolerate software design faults (programming errors) have made use of static and dynamic redundancy approaches similar to those used for hardware faults
Aug 2009 / Q6
Describe the architectural features of any two of the following: Pentium III SPARC Architecture Power PC
Pentium III
The Pentium III[1] brand refers to Intel's 32-bit x86 desktop and mobile microprocessors based on the sixth-generation P6 microarchitectureintroduced on February 26, 1999
The most notable difference was the addition of the SSE instruction set (to accelerate floating point and parallel calculations), and the introduction of a controversial serial number embedded in the chip during the manufacturing process.
RISC Scalar Processors
Designed to issue one instruction per cycle RISC and CISC scalar processors should have
same performance if clock rate and program lengths are equal.
RISC moves less frequent operations into software, thus dedicating hardware resources to the most frequently used operations.
Representative systems: Sun SPARC Intel i860 Motorola M88100 AMD 29000
Power PC
PowerPC (short for Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a RISCarchitecture created by the 1991 Apple–IBM–Motorola alliance, known as AIM.
Originally intended for personal computers, PowerPC CPUs have since become popular as embedded and high-performance processors.
PowerPC is largely based on IBM's earlier POWER architecture, and retains a high level of compatibility with it
Aug 2009 / Q7
Explain two bus arbitration schemes for multiprocessor. What is cache coherence? Explain the static coherence check mechanism
EENG-630 Chapter 5
137
Arbitration
Process of selecting next bus master Bus tenure is duration of control Arbitrate on a fairness or priority basis Arbitration competition and bus
transactions take place concurrently on a parallel bus over separate lines
EENG-630 Chapter 5
138
Central Arbitration
Potential masters are daisy chained Signal line propagates bus-grant from
first master to the last master Only one bus-request line The bus-grant line activates the bus-
busy line
EENG-630 Chapter 5
139
Central Arbitration
EENG-630 Chapter 5
140
Central Arbitration
Simple scheme Easy to add devices Fixed-priority sequence – not fair Propagation of bus-grant signal is slow Not fault tolerant
EENG-630 Chapter 5
141
Independent Requests and Grants Provide independent bus-request and
grant signals for each master Require a central arbiter, but can use a
priority or fairness based policy More flexible and faster than a daisy-
chained policy Larger number of lines – costly
EENG-630 Chapter 5142
EENG-630 Chapter 5
143
Distributed Arbitration
Each master has its own arbiter and unique arbitration number
Use arbitration # to resolve competition Send # to SBRG lines and compare own
# with SBRG # Priority based scheme
Aug 2009 / Q8
How can data hazards be overcome by dynamic scheduling?
Chap. 3 -ILP 1
145
Dynamic Scheduling
3.1 Instruction Level Parallelism: Concepts and Challenges
3.2 Overcoming Data Hazards with Dynamic Scheduling
3.3 Dynamic Scheduling: Examples & The Algorithm
3.4 Reducing Branch Penalties with Dynamic Hardware Prediction
3.5 High Performance Instruction Delivery
3.6 Taking Advantage of More ILP with Multiple Issue
3.7 Hardware-based Speculation
3.8 Studies of The Limitations of ILP
3.10 The Pentium 4
• Handles cases when dependences unknown at compile time – (e.g., because they may
involve a memory reference)
• It simplifies the compiler • Allows code that compiled
for one pipeline to run efficiently on a different pipeline
• Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling
Advantages ofDynamic Scheduling
Chap. 3 -ILP 1
146
Dynamic Scheduling
Why is this in Hardware at run time? Works when can’t know real dependence at compile time Compiler simpler Code for one machine runs well on another
Key Idea: Allow instructions behind stall to proceed. Key Idea: Instructions executing in parallel. There are multiple
execution units, so use them.
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14 Enables out-of-order execution => out-of-order completion
The idea:
HW Schemes: Instruction Parallelism
Even though ADDD stalls, theSUBD has no dependencies
and can run.
Chap. 3 -ILP 1
147
Dynamic Scheduling
Out-of-order execution divides ID stage:1. Issue—decode instructions, check for structural hazards
2. Read operands—wait until no data hazards, then read operands
Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions.
A scoreboard is a “data structure” that provides the information necessary for all pieces of the processor to work together.
We will use In order issue, out of order execution, out of order commit ( also called completion)
First used in CDC6600. Our example modified here for MIPS. CDC had 4 FP units, 5 memory reference units, 7 integer units. MIPS has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.
The idea:HW Schemes: Instruction
Parallelism
Chap. 3 -ILP 1
148
Scoreboard Implications
Out-of-order completion => WAR, WAW hazards? Solutions for WAR
Queue both the operation and copies of its operands Read registers only during Read Operands stage
For WAW, must detect hazard: stall until other completes Need to have multiple instructions in execution phase => multiple
execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations Scoreboard replaces ID, EX, WB with 4 stages
Dynamic Scheduling
Using A Scoreboard
Chap. 3 -ILP 1
149 Four Stages of Scoreboard Control
1. Issue —decode instructions & check for structural hazards (ID1)
If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure.
If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared.
Dynamic Scheduling
Using A Scoreboard
Chap. 3 -ILP 1
150 Four Stages of Scoreboard Control
2. Read operands —wait until no data hazards, then read operands (ID2)
A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit.
When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.
Dynamic Scheduling
Using A Scoreboard
Chap. 3 -ILP 1
151
Four Stages of Scoreboard Control3. Execution —operate on operands (EX)
The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution.
4. Write result —finish execution (WB) Once the scoreboard is aware that the functional
unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction.Example:
DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 Scoreboard would stall SUBD until ADDD reads
operands
Dynamic Scheduling
Using A Scoreboard
Chap. 3 -ILP 1
152
Three Parts of the Scoreboard
1. Instruction status—which of 4 steps the instruction is in
2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit
Busy—Indicates whether the unit is busy or not
Op—Operation to perform in the unit (e.g., + or –)
Fi—Destination register
Fj, Fk—Source-register numbers
Qj, Qk—Functional units producing source registers Fj, Fk
Rj, Rk—Flags indicating when Fj, Fk are ready
3. Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register
Dynamic Scheduling
Using A Scoreboard
Chap. 3 -ILP 1
153
Detailed Scoreboard Pipeline Control
Read operandsExecutio
n complete
Instruction status
Write result
Issue
Bookkeeping
Rj No; Rk No
f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No
Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’;
Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj; Rk not Qk; Result(‘D’) FU;
Rj and Rk
Functional unit done
Wait until
f((Fj( f )≠Fi(FU)
or Rj( f )=No) & (Fk( f )
≠Fi(FU) or Rk( f )=No))
Not busy (FU) and not
result(D)
Dynamic Scheduling
Using A Scoreboard
Aug 2009 / Q9
With the help of the block diagram, explain the crossbar switch system organization for multiprocessors
Derive expression for speed-up, efficiency and throughput of a pipelined processor. What are their ideal values.
May 2009 / Q1