CO_unit3

7/29/2019 CO_unit3

1/22

[ II - IT- II semester Computer Organization -- Unit-3 ]

R.Veeranjaneyulu M.Tech PACE Institute of Technology & Sciences, Ongole Page 1

COMPUTER ORGANIZATION

UNIT -3

Instruction pipelining Pipelining Hazards, Dealing with Branches, 8086 Processor Family, Reduced Instruction Set Computers : Instruction Execution Characteristics, Large Register Files RISC Architecture

7/29/2019 CO_unit3

2/22



Instruction pipelining:

Pipelining is a technique ofdecomposing a sequential process into sub-operations, with each subprocess being executed in a special dedicated segment that operates competently with all other

segments.

Pipelining is an implementation technique whereby multiple instructions are overlapped in execution.

An instruction pipeline operates on a stream of instruction by overlapping the fetch, decode & execute

phases of the instruction cycle.

The pipeline organization of a CPU is similar to an assembly line : the work to be done in aninstruction is broken into smaller steps (pieces), each of which takes a fraction of the time needed to

complete the entire instruction. Each of these steps is a pipe stage (or a pipe segment).

Pipe stages are connected to form a pipe:

The time required for moving an instruction from one stage to the next: a machine cycle (often this

is one clock cycle).

The execution of one instruction takes several machine cycles as it passes through the pipeline.

Two stage pipeline: FI: fetch instruction

EI: execute instruction

We consider that each instruction takes execution time Tex.

Execution time for the 7 instructions, with pipelining:

(Tex/2)*8= 4*Tex

7/29/2019 CO_unit3

3/22



Working of Instructional Pipelining:

An instructional pipeline reads consecutive instructions from memory while previous instructionsare being executed in other segments.

This causes the instruction fetch & execute phases to overlap and perform simultaneousoperations.

When branch instruction is encountered, pipeline must be emptied and all the instructions thathave been read from memory after the branch instruction must be discarded.

A greater number of stages always provides better performanceSix stage pipeline:

FI: fetch instruction FO: fetch operand DI: decode instruction

EI: execute instruction CO: calculate operand address WO: write operand

Branch in a Pipeline:

7/29/2019 CO_unit3

4/22



Pipeline performance:

Pipeline performance measure is in terms of time taken in executing a program.

If a non-pipe line unit that performs a given task and takes a time equal to tn to complete.

The speed up of a pipe line processing over an equivalent non-pipe line processing is defined by the

ratio:

Where K= No. of segments in pipe line.Tp = Time taken by each segment to process a sub-operation.

n= No. of tasks.

Problems with Pipeline:

A greater number of stages increases the overhead in moving information between stagesand synchronization between stages.

With the number of stages the complexity of the CPU grows. With is difficult to keep a large pipeline at maximum rate because of pipeline hazards.

Pipelining Hazards:Pipeline hazards are situations that prevent the next instruction in the instruction stream from

executing during its designated clock cycle. The instruction is said to be stalled. When an instructionis stalled, all instructions later in the pipeline than the stalled instruction are also stalledInstructions earlier than the stalled one can continue. No new instructions are fetched during the stall.

Types of hazards:1. Structural hazards

2. Data hazards

3. Control hazards

Structural Hazards:Structural hazards occur when a certain resource (memory, functional unit) is requested by

more than one instruction at the same time.

Example: Instruction ADD R4,X fetches in the FO stage operand X from memory.The memory doesnt accept another access during that cycle.

Penalty: 1 cycleSolutions: Certain resources are duplicated in order to avoid structural hazards.

Functional units (ALU, FP unit) can be pipelined themselves in order to support several instructions at a

time.A classical way to avoid hazards at memory access is by providing separate data and instruction

caches.

7/29/2019 CO_unit3

5/22



Data Hazards:

This conflict arises when an instruction depends on the result of a pervious instruction, but this result isnot yet variable

We have two instructions, I1 and I2. In a pipeline the execution ofI2 can start before I1

has terminated. If in a certain stage of the pipeline, I2 needs the result produced by I1, but this resulthas not yet been generated, we have a data hazard.

Example:

Before executing its FO stage, the ADD instruction is stalled until the MUL instruction has written theresult into R2.

Penalty: 2 cycles

Solutions:

The problem of data dependency can be solved through the followings.

1.Operand forwarding: The hardware avoid the conflict by routing the data through special pathsbetween pipe line segments.

2.Through Compiler Programs: Insert the No. operation instruction in the program.

After the EI stage of the MUL instruction the result is available by forwarding. The penalty is reduced toone cycle.

Control Hazards: Control hazards are produced by as consequence of branch instructions. Unconditional branch: BR TARGET

TARGET _______

After the FO stage of the branch instruction the address of the target is known and it can be

fetched.

7/29/2019 CO_unit3

6/22



Conditional branch:

Handling branch difficulties: The methods used are

(i) Prefetch target instructions

(ii) Use of branch target buffer(iii) Use of loop buffer.(iv) branch prediction

(v) Delayed branch.

7/29/2019 CO_unit3

7/22



Dealing with Branches:A number of techniques can be used to minimize the impact of the branch instruction i.e the branchpenalty such are

Multiple Streams Prefetch Branch Target Loop buffer Branch prediction Delayed branching

Multiple Streams: Replicate the initial portions of the pipeline and fetch both possible next instructions Have two pipelines Prefetch each branch into a separate pipeline Use appropriate pipeline Increases chance of memory contention Must support multiple streams for each instruction in the pipeline

Prefetch Branch Target: Target of branch is prefetched in addition to instructions following branch Keep target until branch is executed Used by IBM 360/91

Loop buffer: Loop Buffer is small, very high speed memory maintained by the instruction fetch stage of

pipeline and containing n most recently fetched instructions in sequence. Look ahead, look behind buffer. If the branch is to be taken ,the hardware first checks whether branch target is within buffer, If

so next instruction is fetched from the buffer.

Benefits of Loop Buffer: With use of prefetching, Instruction fetched in sequence without the usual memory access

time. If the Branch occurs to target just a few locations ahead of the address of branch

instruction,the target is already in buffer.

Very good for small loops or jumps. If buffer is big enough, entire loop can be held in it -- reducing branch penalty c.f. cache Used by CRAY-1

7/29/2019 CO_unit3

8/22



Branch Prediction:

Make a good guess as to which instruction will be executed next and start that one down thepipeline.

If the guess turns out to be right, no loss of performance in the pipeline If the guess was wrong, empty the pipeline and restart with the correct instruction -- suffering

the full branch penalty. Static guesses: make the guess without considering the runtime history of the program

Predict never taken Predict always taken Predict based on the opcode

Dynamic guesses: track the history of conditional branches in the program Taken / not taken switch History table

Predict never taken:

Assume that jump will not happen

Always fetch next instruction

68020 & VAX 11/780VAX will not prefetch after branch if a page fault would result (O/S v CPU design)

Predict always taken:Assume that jump will happenAlways fetch target instruction

Predict by Opcode:

Some instructions are more likely to result in a jump than othersCan get up to 75% success

Taken/Not taken switch:

Based on previous historyGood for loops

Branch Prediction Flowchart:

7/29/2019 CO_unit3

9/22



Branch Prediction State Diagram

Dealing With Branches:

7/29/2019 CO_unit3

10/22



Delayed branch: Minimize the branch penalty by finding valid instructions to execute in the pipeline while the

branch address is being resolved.

Compiler is tasked with reordering the instruction sequence to find enough independentinstructions (wrt to the conditional branch) to feed into the pipeline after the branch that thebranch penalty is reduced to zero.

Consider the sequence:Instruction xInstruction x+1Instruction x+2

Conditional branch

Do not take jump until you have to Rearrange instructions Implemented on many RISC architectures

7/29/2019 CO_unit3

11/22



8086 Processor Family:

8086 Register Organization:

Intel 8086 was the first 16-bit microprocessor introduced by Intel in 1978.

The register organization includes the following types of Registers.

1. General Purpose:

There are 8 32-bit general purpose registers Used for all types of x86 instructions Holds the operands for address calculations. String instructions use the contents of ECX,ESI and EDI registers In 64-bit there are 16 64-bit general purpose registers.

2.Segment:

The 16-bit segment register selectors which segment selectors, which index into segment tables The Code Segment(CS):Register references the segment containing the instruction being

executed.

The Stack Segment(SS):Register references contains a user-visible stack. The Remaining segment registers(DS,ES,FS,GS) enable the user to separate the data segments

at a time.

3.FLAGS: The 32-bit EFLAGS register contain the conditional codes and various mode bits.

4.Instruction Pointer: Contain the address of the current instruction.

5.Numaric:

Each register holds an extended precision 80-bit floating point numbers. There are 8 registers that function as a stack, with push and pop operations available in the

instruction set.

6.Control:

The 16-bit control registers contains bit that control the operations of floating point unit. It include rounding, exception, precision controls

7.Staus:

16-bit status register contains bits that reflects the current state of floating point unit. It include 3-bit pointer to the top of the stack Conditional codes are reported

8.Tag word: 16-bit register contains a 2-bit tag for each floating point numeric register which indicates the

nature of the contents of corresponding register.

The four possible values are valid, zero, special and empty Enable program to check the contents of the numeric register without performing complex

decoding of actual data in the register.

7/29/2019 CO_unit3

12/22



EFLAGS Registers:

There is a special register in the processor called EFLAGS. This register is 32 bits wide and most of those

bits are used to track a variety of conditions in the processor. It includes the six condition codes (likecarry, parity, auxiliary, zero, sign, overflow) which reports results of an integer operations.

7/29/2019 CO_unit3

13/22



Trap Flag(TF): when set, causes an interrupt after the execution of each instruction. Used for

debugging.

Interrupt Enable Flag (IF): when set ,the processor will recognize the external interrupts.

Direction Flag (DF): It is used in string processing.

I/O privilege flags(IOPL):Used in protected mode to generate four levels of securityResume Flag(RF): It enables you to turn off certain exceptions while debugging code.

Identification Flag (IF):If this bit can be set and cleared, then the processor supports the ProcessorID

instruction. It provide information about vendor, family and model.

Nested Task Flag: Indicate current task is nested within another task in protected mode.

Virtual Mode: Allow the programmer to enable or disable virtual mode.

Virtual Interrupt Flag(VIF) & Virtual Interrupt Pending(VIP) are used in multi tasking

environment.

Control Registers:

MMX Registers:

MMX uses several 64 bit data types

Use 3 bit register address fields

8 registers No MMX specific registers

Aliasing to lower 64 bits of existing floating point registers
http://www.c-jump.com/CIS77/asm_images/io_privilege_levels.pnghttp://www.c-jump.com/CIS77/asm_images/io_privilege_levels.png

7/29/2019 CO_unit3

14/22



Interrupt Processing:

Interrupt processing with in a processor is facility provided to support the operating system. It allow the application programmer to be suspended, in order that a variety of interrupt

conditions can be serviced and latter resumed.

Interrupts & Exceptions:

Interrupt is generated by a signal from hardware, and it may occur at random times during the

execution of a program.

Exception is generated from software an it is provoked by the execution of an instruction.

There are two sources of interrupts and exceptions.

Interrupts:Maskable:Received on the processors INTR pin.The processor does not recognize a maskable

interrupt unless the Interrupt Enable Flag(IF) is set.Nonmaskable: Received on the processors NMI pin, Reorganization of such interrupts can not

be prevented.

Exceptions:

Processor detected: Results when processor encounters an error while attempting to execute

an instruction.

Programmed: These are instructions that generate an exception.

Interrupt vector table:Each interrupt type assigned a numberIndex to vector table256 * 32 bit interrupt vectors

5 priority classes :

Class1: Traps Previous instructions

Class2: External Interrupts

Class3: Faults from fetching next instruction

Class4: Faults from decoding the next instruction

Class5: Faults on executing an instruction

7/29/2019 CO_unit3

15/22



RISC (Reduced Instruction Set Computers):

Major Advances in Computers:

The family concept IBM System/360 1964 DEC PDP-8 Separates architecture from implementation

Microporgrammed control unit Idea by Wilkes 1951 Produced by IBM S/360 1964

Cache memory IBM S/360 model 85 1969

Solid State RAM (See memory notes)

Microprocessors Intel 4004 1971

Pipelining Introduces parallelism into fetch execute cycle

Multiple processors Reduced Instruction Set Computer

Key features

Large number of general purpose registers or use of compiler technology to optimize register use Limited and simple instruction set Emphasis on optimising the instruction pipeline

Instruction Execution Characteristics:

Driving force for CISC:

Software costs far exceed hardware costs Increasingly complex high level languages Semantic gap Leads to:

Large instruction sets More addressing modes Hardware implementations of HLL statements

e.g. CASE (switch) on VAXIntention of CISC: Ease compiler writing Improve execution efficiency

Complex operations in microcode Support more complex HLLs

Execution Characteristics:

Operations performed Operands used Execution sequencing

7/29/2019 CO_unit3

16/22



Studies have been done based on programs written in HLLs Dynamic studies are measured during the execution of the program

Operations:

Assignments Movement of data

Conditional statements (IF, LOOP) Sequence control

Procedure call-return is very time consuming Some HLL instruction lead to many machine code operations

Operands:

Mainly local scalar variables Optimisation should concentrate on accessing local variables

Procedure Calls:

Very time consuming Depends on number of parameters passed Depends on level of nesting Most programs do not do a lot of calls followed by lots of returns Most variables are local (c.f. locality of reference)

Implications:

Best support is given by optimising most used and most time consuming features Large number of registers

Operand referencing Careful design of pipelines

Branch prediction etc. Simplified (reduced) instruction set

Large Register File:

Software solution Require compiler to allocate registers Allocate based on most used variables in a given time Requires sophisticated program analysis

Hardware solution Have more registers Thus more variables will be in registers

Registers for Local Variables:

Store local scalar variables in registers Reduces memory access Every procedure (function) call changes locality Parameters must be passed Results must be returned Variables from calling programs must be restored

Register Windows:

Only few parameters Limited range of depth of call Use multiple small sets of registers Calls switch to a different set of registers

7/29/2019 CO_unit3

17/22



Returns switch back to a previously used set of registers Three areas within a register set

Parameter registers Local registers Temporary registers Temporary registers from one set overlap parameter registers from the next This allows parameter passing without moving data

Circular Buffer Organization of overlapped windows:

Operation of Circular Buffer :

When a call is made, a current window pointer is moved to show the currently active registerwindow.

If all windows are in use, an interrupt is generated and the oldest window (the one furthest backin the call nesting) is saved to memory.

A saved window pointer indicates where the next saved windows should restore to.Global Variables:

Allocated by the compiler to memory

7/29/2019 CO_unit3

18/22

7/29/2019 CO_unit3

19/22



Why CISC:

Compiler simplification? Disputed Complex machine instructions harder to exploit Optimization more difficult

Smaller programs? Program takes up less memory but Memory is now cheap May not occupy less bits, just look shorter in symbolic form

More instructions require longer op-codes Register references require fewer bits

Faster programs Bias towards use of simpler instructions More complex control unit Microprogram control store larger thus simple instructions take longer to execute It is far from clear that CISC is the appropriate solution

RISC Characteristics:

One instruction per cycle Register to register operations Few, simple addressing modes Few, simple instruction formats

Hardwired design (no microcode) Fixed instruction format More compile time/effort

RISC VS CISC

7/29/2019 CO_unit3

20/22



RISC Pipelining:

Most instructions are register to register Two phases of execution

I: Instruction fetch E: Execute

ALU operation with register input and output For load and store

I: Instruction fetch E: Execute

Calculate memory address D: Memory

Register to memory or memory to register operationEffects of Pipelining:

Optimization of Pipelining: Delayed branch

Does not take effect until after execution of following instruction This following instruction is the delay slot

Delayed Load Register to be target is locked by processor Continue execution of instruction stream until register required Idle until load complete Re-arranging instructions can allow useful work whilst loading

Loop Unrolling

7/29/2019 CO_unit3

21/22



Replicate body of loop a number of times Iterate loop fewer times Reduces loop overhead Increases instruction parallelism Improved register, data cache or TLB locality

Example:

do i=2, n-1

a[i] = a[i] + a[i-1] * a[i+l]

end do

Becomes

do i=2, n-2, 2

a[i] = a[i] + a[i-1] * a[i+i]

a[i+l] = a[i+l] + a[i] * a[i+2]

end do

if (mod(n-2,2) = i) then

a[n-1] = a[n-1] + a[n-2] * a[n]

end if

Use of Delayed Branch:

7/29/2019 CO_unit3

22/22


Assignment Questions

1.What is a pipeline register. What is the use of it? Explain in detail?2. (a) Differentiate RISC and CISC computers.

(b) Explain RISC pipelining.

3.Explain vector processing?4. (a) What is pipeline? Explain.

(b) Explain arithmetic pipeline.

Documents

CO_unit3