1.prallelism

Unit-4 Parallelism

Chapter - 9

• Executing two or more operations at the same time is known as Parallelism.

Goals of Parallelism:• The purpose of parallel processing is to speedup the

computer processing capability or in words, it increases the computational speed.

• Increases throughput, i.e. amount of processing that can be accomplished during a given interval of time.

• Improves the performance of the computer for a given clock speed.

• Two or more ALUs in CPU can work concurrently to increase throughput.

• The system may have two or more processors operating concurrently.

Exploitation of Concurrency:

Techniques of Concurrency: Overlap : execution of multiple operations by heterogenous

functional units. Parallelism : execution of multiple operations by homogenous

functional units.

Throughput Enhancement A computer’s performance is measured by the time taken for

executing a program. The program execution involves performing instruction cycles, which

includes two types of operations: Internal Micro-operations: performed inside the hardware functional

units such as the processor, memory, I/O etc. Transfer of information: between different functional hardware units

for Instruction fetch, operand fetch, I/O operation etc.

Types of Parallelism:

Instruction Level Parallelism (ILP) Pipelining Superscalar

Processor Level Parallelism Array Computer Multiprocessor

Instruction Pipeline (sec-9.4)

• An instruction pipeline reads consecutive instructions from memory while previous instructions are being executed in other segments.

• Computer needs to process each instruction with the following sequence of steps.

Fetch the instruction from memory Decode the instruction Calculate the effective address Fetch the operands from memory Execute the instruction Store the result in the proper place

Four segment CPU Pipeline

Fetch Instruction

Decode & calculate effective Address

Branch?

Fetch Operand

Execute Instruction

Interrupt?

Interrupt handling

Update PC

Empty Pipe

Yes

No

NoYes

Timing of Instruction Pipeline

Step 1 2 3 4 5 6 7 8 9 10 11 12 13

1 FI DA FO EX

2 FI DA FO

3 FI DA

4 FI - - FI DA FO EX

5 - - - FI DA FO EX

6 FI DA FO EX

7 FI DA FO EX

Instruction

Pipeline Conflicts

• Resource conflicts caused by access to memory by two segments at the same time. These may be resolved by using separate instruction and data memories.

• Data Dependency conflicts arise when an instruction depends on the result of a previous instruction, but this result is not yet available.

• Branch Difficulties arise from branch and other instructions that change the value of PC.

Instruction-level parallelism (ILP)

• Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously.

• Micro-architectural techniques that are used to exploit ILP include: Instruction pipelining where the execution of multiple instructions

can be partially overlapped.

Superscalar execution in which multiple execution units are used to execute multiple instructions in parallel. In typical superscalar processors, the instructions executing simultaneously are adjacent in the original program order.

http://en.wikipedia.org/wiki/Computer_program

http://en.wikipedia.org/wiki/Instruction_pipelining

http://en.wikipedia.org/wiki/Superscalar

http://en.wikipedia.org/wiki/Execution_unit

• A superscalar CPU architecture implements a form of parallelism called instruction-level parallelism within a single processor.

• It therefore allows faster CPU throughput than would otherwise be

possible at a given clock rate.

• A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor.

• Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.

• While a superscalar CPU is typically also pipelined.

• Pipelining and Superscalar architecture are considered different performance enhancement techniques.

http://en.wikipedia.org/wiki/Central_processing_unit

http://en.wikipedia.org/wiki/Parallel_computer

http://en.wikipedia.org/wiki/Instruction_level_parallelism

http://en.wikipedia.org/wiki/Throughput

http://en.wikipedia.org/wiki/Clock_rate

http://en.wikipedia.org/wiki/Arithmetic_logic_unit

http://en.wikipedia.org/wiki/Multiplication_ALU

http://en.wikipedia.org/wiki/Pipeline_(computing)

The superscalar technique is associated with several identifying characteristics (within a given CPU core):

Instructions are issued from a sequential instruction stream. CPU hardware dynamically checks for data dependencies between

instructions at run time (versus software checking at compile time) The CPU accepts multiple instructions per clock cycle.

• Available performance improvement from superscalar techniques is limited by two key areas:

The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism, and

The complexity and time cost of the dispatcher and associated dependency checking logic.

The branch instruction processing.

http://en.wikipedia.org/wiki/Data_dependencies

http://en.wikipedia.org/wiki/Compile_time

Processor Level Parallelism

• Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system.

• The term also refers to the ability of a system to support more than one processor and/or the ability to allocate tasks between them.

• Multiprocessing sometimes refers to the execution of multiple concurrent software processes in a system as opposed to a single process at any one instant.

• The terms multitasking or multiprogramming are more appropriate to describe this concept, which is implemented mostly in software, whereas multiprocessing is more appropriate to describe the use of multiple hardware CPUs.

• A system can be both multiprocessing and multiprogramming, only one of the two, or neither of the two.

• In a multiprocessing system, all CPUs may be equal, or some may be reserved for special purposes.

http://en.wikipedia.org/wiki/CPU

http://en.wikipedia.org/wiki/Computer_multitasking

http://en.wikipedia.org/wiki/Computer_multitasking

• In multiprocessing, the processors can be used to execute a single sequence of instructions in multiple contexts

• In a single instruction stream, single data stream or SISD, one processor sequentially processes instructions, each instruction processes one data item.

• Single-instruction, multiple-data or SIMD, often used in vector processing

• Multiple sequences of instructions in a single context multiple-instruction, single-data or MISD, used to describe pipelined processors.

• Multiple sequences of instructions in multiple contexts (multiple-instruction, multiple-data or MIMD.

Amdahl's law

• Amdahl's law, also known as Amdahl's argument, is named after computer architect Gene Amdahl, and is used to find the maximum expected improvement to an overall system when only part of the system is improved.

• It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors.

• The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program.

• Amdahl's law is a model for the relationship between the expected speedup of parallelized implementations of an algorithm relative to the serial algorithm, under the assumption that the problem size remains the same when parallelized.

• The law is concerned with the speedup achievable from an improvement to a computation that affects a proportion P of that computation where the improvement has a speedup of S. (For example, if an improvement can speed up 30% of the computation, P will be 0.3; if the improvement makes the portion affected twice as fast, S will be 2.)

• Amdahl's law states that the overall speedup of applying the improvement will be:

Old Running Time = 1

New Running Time (1-P)+P/S

• To see how this formula was derived, assume that the running time of the old computation was 1, for some unit of time. The running time of the new computation will be the length of time the unimproved fraction takes, (1 − P), plus the length of time the improved fraction takes.

• The length of time for the improved part of the computation is the length of the improved part's former running time divided by the speedup, making the length of time of the improved part P/S. The final speedup is computed by dividing the old running time by the new running time, which is what the above formula does.

• In the case of parallelization, Amdahl's law states that if P is the proportion of a program that can be made parallel (i.e. benefit from parallelization), and (1 − P) is the proportion that cannot be parallelized (remains serial), then the maximum speedup that can be achieved by using N processors is :

1/(1-P)+P/N.• P can be estimated by using the measured speedup SU on a specific

number of processors NP using• Pestimated = (1/SU – 1) / (1/NP-1).• P estimated in this way can then be used in Amdahl's law to predict speedup for

a different number of processors.

Basic Page Replacement

1. Find the location of the desired page on disk

2. Find a free frame: - If there is a free frame, use it - If there is no free frame, use a page replacement algorithm to select a victim frame

3. Bring the desired page into the (newly) free frame; update the page and frame tables

4. Restart the process

Page Replacement : OPT (Optimal Policy)

• Replace page that will not be used for longest period of time. It requires future knowledge.

• 3 frames example

• Adv : Reduces Page Faults• Disadv : It is difficult to implement as future knowledge of reference

string is required.

CAO Model Question PaperUnit – 2

Q:1 Discuss hardwired control design method.

Q:2 Explain the concept of Micro programmed Sequencer.

Q:3 Explain five stages instruction cycle with the help of flowchart. Q:4 What is Content Addressable Memory? Explain the match logic in

CAM.

Q:5 What do you mean by “Next Address Generator” present in a Microprogrammed Control Unit? Explain briefly along with its block diagram.

Q:6 Explain 2D RAM organization with suitable diagram. (J. P. Hayes-Sec-6.1.2,fig 6.8, fig 6.13)


Q: 1 Discuss the principle of Locality of Reference associated with the Memory Hierarchy.

Q: 2 What are the reasons for using Virtual memory? Distinguish between Paging and Segmentation.

Q: 3 What are LRU and OPT Policies of page replacement? Compare them.

Q: 4 Explain working of a cache memory. What are the relative advantages and disadvantages of direct and Associative mapping of cache memory?

Modes of transfer (11.4)

• Data transfer between the central computer and the I/O devices may be handled in a variety of modes.

• The modes of transfer are:

1. Programmed I/O.

2. Interrupt-initiated I/O

3. Direct memory access (DMA)

Programmed I/O

• Programmed I/O operations are the result of I/O instructions written in the computer program.

• Each data item transfer is initiated by an instruction in the program.• Usually, the transfer is to and from CPU register and peripheral.• Other instructions are needed to transfer the data to and from CPU

and memory.• Transferring data under program control requires constant

monitoring of the peripheral by the CPU.• Once a data transfer is initiated, the CPU is required to monitor the

interface to see when a transfer can again be made.• It is up to the programmed instructions executed in the CPU to keep

close tabs on everything that is taking place in the interface unit and the I/O device.

• In this method, the CPU stays in a program loop until the I/O unit indicates that it is ready for data transfer.

• This is a time-consuming process since it keeps the processor busy needlessly.

Interrupt Initiated IO

• In the programmed I/O method, the CPU stays in a program loop until the I/O unit indicates that it is ready for data transfer.• This is a time-consuming process since it keeps the processor busy needlessly.• It can be avoided by using interrupt facility and special commands to inform the interface to issue an interrupt request signal when the data are available from the device.• In the mean-time the CPU can proceed to execute another program.• The interface meanwhile keeps monitoring the device.• When the interface determines that the device is ready for the data transfer, it generates an interrupt request to the computer.• Upon detecting the external interrupt signal, the CPU momentarily stops the task it is processing, branches to a service program to process the I/O transfer, and then returns to the task it was originally performing.

Types of Interrupt

1. External interrupts2. Internal interrupts3. Software interrupts

• External interrupts come from I/O devices, from a timing device, from a circuit monitoring the power supply, or from any other external source. For example: Timeout interrupt

• Internal interrupts arise from illegal or erroneous use of an instruction or data. Internal interrupts are also called traps. For example, attempt to divide by zero.

The difference between internal interrupt and external interrupt The internal interrupt is initiated by some exceptional

condition caused by program itself rather than by an external event.

External interrupts depend on external conditions that are independent of the program.

• Software Interrupt: A software interrupt is initiated by executing an instruction. Software interrupt is a special call instruction that behaves like an interrupt rather than a subroutine call. The most common use of a software interrupt is associated with a supervisor call instruction. This instruction provides means for switching from a CPU user mode to the supervisor mode.

DMA (Direct Memory Access – 11.6)

• Direct memory access is an I/O technique used for high speed data transfer.

• In DMA, the interface transfers data into and out of the memory unit through the memory bus.

• In DMA, the CPU releases the control of the buses to a device called a DMA controller.

• Removing the CPU from the path and letting the peripheral device manage the memory buses directly would improve the speed of transfer.

• The CPU initiates the transfer by supplying the interface with the starting address and the number of words needed to be transferred and then proceeds to execute other tasks.

• When the transfer is made, the DMA requests memory cycles through the memory bus.

• When the request is granted by the memory controller, the DMA transfers the data directly into memory.

• The CPU merely delays its memory access operation to allow the direct memory I/O transfer.

• Many computers combine the interface logic with the requirements for direct memory access into one unit and call it an I/O processor ( IOP).

• A DMA controller takes over the memory buses to manage the transfer directly between the I/O device and memory using 2 special control signals BR And BG.

• The BR (bus request) output signal is used by the DMA controller to the CPU to take control of memory buses.

• The CPU then activates the BG (BUS GRANT) signal to inform the external DMA that the buses are in high-impedance state.

• Then DMA takes the control of memory buses.• DMA CONTROLLER: It needs the usual circuit of an interface to

communicate with the CPU and I/O device.

CPU bus signals for DMA Transfer

BR DBUS

ABUS CPU RDBG WR

Block diagram of DMA Controller

Address Bus buffers

Address registers

Word count Register

Control Register

DS

RS DMARD Control logicWR

BR

BG

Interrupt

DMA Req

DMA ACK

To IO Device

Data Busbuffers

Data bus

Address bus


Q:1 State Amdahl’s law. Discuss its significance.

Q:2 Explain 3 major types of instructions.

Q:3 Explain the concept of Memory Hierarchy.

Q:4 Write a note on Computer Registers

Q:5 Short Note on:

1. DMA

2. Priority Interrupt

3. Superscalar Processing

Documents

1.prallelism