Instruction pipelining

INSTRUCTION PIPELINING

What is pipelining?

• The greater performance of the cpu is achieved by instruction pipelining.

• 8086 microprocesor has two blocks

BIU(BUS INTERFACE UNIT) EU(EXECUTION UNIT)

• The BIU performs all bus operations such as instruction fetching,reading and writing operands for memory and calculating the addresses of the memory operands. The instruction bytes are transferred to the instruction queue.

• EU executes instructions from the instruction system byte queue.

• Both units operate asynchronously to give the 8086 an overlapping instruction fetch and execution mechanism which is called as Pipelining.

INSTRUCTION PIPELINING

First stage fetches the instruction and buffers it.

When the second stage is free, the first stage passes it the buffered instruction.

While the second stage is executing the instruction,the first stage takes advantages of any unused memory cycles to fetch and buffer the next instruction.

This is called instruction prefetch or fetch overlap.

Inefficiency in two stage instruction pipelining

There are two reasons• The execution time will generally be longer

than the fetch time.Thus the fetch stage may have to wait for some time before it can empty the buffer.

• When conditional branch occurs,then the address of next instruction to be fetched become unknown.Then the execution stage have to wait while the next instruction is fetched.

Two stage instruction pipelining

Simplified view wait new address wait Instruction Instruction

Result discard EXPANDED VIEW

Fetch Execute

Decomposition of instruction processing

To gain further speedup,the pipeline have more stages(6 stages)

Fetch instruction(FI) Decode instruction(DI) Calculate operands (i.e. EAs)(CO) Fetch operands(FO) Execute instructions(EI) Write operand(WO)

SIX STAGE OF INSTRUCTION PIPELINING

Fetch Instruction(FI)

Read the next expected instruction into a buffer Decode Instruction(DI)

Determine the opcode and the operand specifiers. Calculate Operands(CO)

Calculate the effective address of each source operand.

Fetch Operands(FO)

Fetch each operand from memory. Operands in registers need not be fetched.

Execute Instruction(EI)

Perform the indicated operation and store the result

Write Operand(WO)

Store the result in memory.

Timing diagram for instruction pipeline operation

High efficiency of instruction pipelining

Assume all the below in diagram• All stages will be of equal duration.• Each instruction goes through all the six

stages of the pipeline.• All the stages can be performed parallel.• No memory conflicts.• All the accesses occur simultaneously. In the previous diagram the instruction

pipelining works very efficiently and give high performance

Limits to performance enhancement

The factors affecting the performance are

1. If six stages are not of equal duration,then there will be some waiting time at various stages.

2. Conditional branch instruction which can invalidate several instruction fetches.

3. Interrupt which is unpredictable event.4. Register and memory conflicts.5. CO stage may depend on the contents of a

register that could be altered by a previous instruction that is still in pipeline.

Effect of conditional branch on instruction pipeline operation

Conditional branch instructions

Assume that the instruction 3 is a conditional branch to instruction 15.

Until the instruction is executed there is no way of knowing which instruction will come next

The pipeline will simply loads the next instruction in the sequence and execute.

Branch is not determined until the end of time unit 7.

During time unit 8,instruction 15 enters into the pipeline.

No instruction complete during time units 9 through 12.

This is the performance penalty incurred because we could not anticipate the branch.

Simple pattern for high performance

• Two factors that frustrate this simple pattern for high performance are

1. At each stage of the pipeline,there is some overhead involved in moving data from buffer to buffer and in performing various preparation and delivery functions.This overhead will lengthen the execution time of a single instruction.This is significant when sequential instructions are logically dependent,either through heavy use of branching or through memory access dependencies

2. The amount of control logic required to handle memory and register dependencies and to optimize the use of the pipeline increases enormously with the number of stages.

Six-stage CPU instruction pipeline

Dealing with branches

A variety of approaches have been taken for

dealing with conditional branches.Multiple streamsPrefetch branch target.Loop bufferBranch predictionDelayed branch

Multiple streams

In simple pipeline,it must choose one of the two instructions to fetch next and may make wrong choice.

In multiple streams allow the pipeline to fetch both instructions making use of two streams.

Problems with this approach• With multiple pipelines there are contention delays

for the access to the registers and to memory.• Additional branch instructions may enter the

pipeline(either stream)before the original branch decision is resolved.Each such instructions needs an additional branch.

Examples:• IBM 370/168 AND IBM 3033.

Prefetch Branch Target

When a conditional branched is recognized,the target of the branch is prefetched,in addition to the instruction following the branch.

This target is then saved until the branch instruction is executed.

If the branch is taken,the target has already been prefetched.

The IBM 360/91 uses this approach.

Loop buffer

A loop buffer is a small,very high-speed memory maintained in instruction fetch stage.

It contains n most recently fetched instructions in sequence.

If a branch is to be taken,the hardware first checks whether the branch target is within the buffer.

If so,the next instruction is fetched from the buffer.

Benefits of loop bufferInstructions fetched in sequence will be

available without the usual memory access time

If the branch occurs to the target just a few locations ahead of the address of the branch instruction, the target will already be in the buffer. This is useful for the rather common occurrence of IF-THEN and IF-THEN-ELSE sequences.

This is well suited for loops or iterations, hence named loop buffer.If the loop buffer is large enough to contain all the instructions in a loop,then those instructions need to be fetched from memory only once,for the first iteration.

For subsequent iterations,all the needed instructions are already in the buffer.

Cont..,

Loop buffer is similar to cache.Least significant 8 bits are used to index the buffer and

remaining MSB are checked to determine the branch target.

Branch address 8 Instruction

to be decoded in

case of hit Most significant address bits compared to determine a hit

Loop buffer(256 bytes)

Branch prediction

Various techniques used to predict whether a branch will be taken. They are

Predict Never TakenPredict Always Taken STATICPredict by OpcodeTaken/Not Taken SwitchBranch History Table DYNAMIC

Static branch strategies

• STATIC(1,2,3)-They do not depend on the execution history

• Predict Never Taken Always assume that the branch will not be

taken and continue to fetch instruction in sequence.• Predict Always Taken Always assume that the branch will be taken

and always fetch from target.• Predict by Opcode Decision based on the opcode of the branch

instruction. The processor assumes that the branch will be taken for certain branch opcodes and not for others.

Dynamic branch strategies

DYNAMIC(4,5)-They depend on the execution history.They attempt to improve the accuracy of prediction

by recording the history of conditional branch instructions in a program.

For example,one or more bits can be associated with conditional branch instruction that reflect the recent history.

These bits are referred as taken/not taken switch.These history bits are stored in temporary high-

speed memory.Then associate the bits with any conditional branch

instruction and make decision.Another possibility is to maintain a small table for

recent history with one or more bits in each entry.

Cont..,

With only one bit of history, an error prediction will occur twice for each use of the loop:once on entering the loop and once on exiting.

The decision process can be represented by a finite-state machine with four stages.

Cont..,

If the last two branches of the given instruction have taken same path,the prediction is to make the same path again.

If the prediction is wrong it remains same for next time also

But when again the prediction went wrong, the opposite path will be selected.

Greater efficiency could be achieved if the instruction fetch could be initiated as soon as the branch decision is made.

For this purpose, information must be saved, that is known as branch target buffer,or a branch history table.

Branch history table

It is a small cache memory associated with instruction fetch stage.

Each entry in table consist of elements:Address of branch instructionSome number of history bits. Information about the target instruction.• The third field may contain address or

target instruction itself.

Dealing with branches

Branching strategies

If branch is taken,some logic in the processor detects that and instruct to fetch next instruction from target address.

Each prefetch triggers a lookup in the branch history table.

If no match is found,the next sequential instruction address is used for fetch.

If match occurs, a prediction is made based on the state of the instruction.

When the branch instruction is executed,the execute stage signals the branch history table logic with result.

Delayed branch

It is possible to improve pipeline performance by automatically rearranging instructions within the program.

So that branch instructions occur later than actually desired.

Intel 80486 Pipelining

• Fetch— From cache or external memory— Put in one of two 16-byte prefetch buffers— Fill buffer with new data as soon as old data consumed— Average 5 instructions fetched per load— Independent of other stages to keep buffers full• Decode stage 1— Opcode & address-mode info— At most first 3 bytes of instruction— Can direct D2 stage to get rest of instruction• Decode stage 2— Expand opcode into control signals— Computation of complex address modes• Execute— ALU operations, cache access, register update• Writeback— Update registers & flags— Results sent to cache & bus interface write buffers

THANK YOU

Technology

Instruction pipelining