75
Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 1 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt art 8 nstruction Level Parallelism (ILP) - ipelining Computer Architecture Slide Sets WS 2010/2011 Prof. Dr. Uwe Brinkschulte Prof. Dr. Klaus Waldschmidt

Part 8 Instruction Level Parallelism (ILP) - Pipelining

  • Upload
    eara

  • View
    63

  • Download
    1

Embed Size (px)

DESCRIPTION

Computer Architecture Slide Sets WS 2010/2011 Prof. Dr. Uwe Brinkschulte Prof. Dr. Klaus Waldschmidt. Part 8 Instruction Level Parallelism (ILP) - Pipelining. Parallel Computing. Pipelining Superscalar VLIW EPIC. Instruction-Level Parallelism. Thread- and Task-Level Parallelism. - PowerPoint PPT Presentation

Citation preview

Page 1: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 1 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Part 8

Instruction Level Parallelism (ILP) -Pipelining

Computer Architecture

Slide Sets

WS 2010/2011

Prof. Dr. Uwe BrinkschulteProf. Dr. Klaus Waldschmidt

Page 2: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 2 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Parallel Computing

Pipelining

Superscalar

VLIW

EPIC

Multithreading

Multiprocessing

Multi-Cores

Cluster of Computers

Cloud- and Grid-Computing

Thread- and Task-Level Parallelism

Instruction-Level Parallelism

Page 3: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 3 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Architectures with instruction level parallelism (ILP)Pipelining vs. concurrency

Basis of most computer architectures is still the well-known von

Neumann or Harvard principle. This principle relies on a sequential

operation.

In modern high performance processors this sequential operation

mode is extended by instruction level parallelism (ILP).

ILP can be implemented by two modes of parallelism:

• Parallelism in time (pipelining)

• Parallelism in space (concurrency)

Page 4: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 4 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

These two techniques of parallelism are an important feature for the high

performance in combination with the technological improvement.

• Parallelism in time (pipelining) means that the execution of instruction is

overlapped in time by partitioning the instruction cycle.

• Parallelism in space (concurrency) means that more than one instruction

is executed in parallel, either in order or out of order.

Both techniques are combined in modern microprocessors and defines the

instruction level parallelism for better performance.

Pipelining vs. concurrency

Page 5: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 5 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

stage

t tcycle

# #

Pipelining vs. concurrency

pipelining concurrency

instruction 1 instruction 1

instruction 2 instruction 2

instruction 3 instruction 3

Parallelism in time relies on the assembly line principle, which is also

very matured in the automotive production.

It can be effective combined with concurrency.

Among computer architectures an assembly line is called pipeline

Page 6: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 6 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

"Pipelines accelerate execution speed in the same way like Henry Ford

revolutionized car manufacturing with the introduction of the assembly line"

(Peter Wayner, 1992)

Pipelining means the fragmentation of a machine instruction into several

partial operations.

These partial operations are executed by partial units in a sequential

and synchronized manner.

Every processing unit executes only one specific partial operation.

All partial processing units are called a pipeline in total.

Pipelining vs. concurrency

Page 7: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 7 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Fragmentation of the instruction cycle

1. instruction fetch

The instruction addressed by the program counter is loaded from

main memory or a cache into the instruction register. The program

counter is incremented.

2. instruction decode

Internal control signals are generated according to the instructions

opcode and addressing modes.

3. operand fetch

The operands are provided by registers or functional units.

Possible fragmentation into 5 stages:

Page 8: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 8 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Fragmentation of the instruction cycle

4. execute

The operation is executed with the operands.

5. write back

The result is written into a register or bypassed to serve as

operand for a succeeding operation.

Depending on the instruction or instruction class some stages may be

skipped.

The entirety of stages is called instruction cycle.

Page 9: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 9 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

• In the first stage, the fetch unit accesses the instruction

• The fetched instruction is passed to instruction decode unit.

• While this second unit processes the instruction, the first unit already

fetches the next instruction.

• In best case scenarios n-stage pipelines executes n instructions in

parallel.

• Each instruction is in a different stage of its execution.

• When the pipeline is filled, the execution of one instruction is

finished every clock cycle.

• A processor capable of finishing one instruction per clock cycle is

called a scalar processor

Instruction pipelining

Page 10: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 10 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

1. instruction

2. instruction

3. instruction

clock

instruction fetch

instruction decode

operand fetch

executewrite back

instruction fetch

instruction decode

operand fetch

executewrite back

instruction fetch

instruction decode

operand fetch

executewrite back

Instruction pipelining

Page 11: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 11 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Pipeline design principles

• Pipeline stages are linked by registers

• The instruction and the intermediate result is forwarded every clock cycle (in special cases every half clock cycle) to the next pipeline register.

• A pipeline is as fast as its slowest stage

• Therefore, an important issue in pipeline design is to assure that the stages consume equivalent amounts of time

• A high number of pipeline stages (often called superpipeline) leads to short clock cycles and higher speedup

• But a stall of a long pipeline, e.g. due to a control flow dependency, results in long wait times till the pipeline can be refilled.

• Thus, a real trade off exists for the designer.

Page 12: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 12 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Basic pipeline measures

Pipelining belongs to the class of fine grain parallelism. It takes place at a microarchitectural level.

Definitions:

• An operation is the application

of a function F to operands. An

operation produces a result.

• An operation can be made up

of a set of partial operations f1 ... fp

(in most cases p = k).

It is assumed that the partial

operations are applied in

sequential order.

• An instruction defines through

its format the function, operands

and result.

A k-stage pipeline executes n operations of F in cycles

tp (n,k) = k + (n – 1)

k cycles to execute the first instruction (fill pipline)

n-1 cycles to execute the remaining n-1 instructions

Page 13: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 13 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

figure shows example: tp(10,5) = 5 + (10-1) = 14

start-upor fill

processing

drain

t 1 2 3 4 5stages

Pipeline operation

i

i+1 i

i+2 i+1 i

i+3 i+2 i+1 i

i+4 i+3 i+2 i+1 i

i+5 i+4 i+3 i+2 i+1

i+6 i+5 i+4 i+3 i+2

i+7 i+6 i+5 i+4 i+3

i+8 i+7 i+6 i+5 i+4

i+9 i+8 i+7 i+6 i+5

i+9 i+8 i+7 i+6

i+9 i+8 i+7

i+9 i+8

i+9

Page 14: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 14 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Pipeline throughput:

Pipeline speedup:

In a best case scenario where a high number of linear succeeding operations is executed pipeline speedup converts to the number of pipeline stages.

Basic pipeline measures

knk

kn

nkS

nk

kn

timeexecutionpipelined

timeexecutiondunpipelineknS

)1(),(

)1(),(

lim

cycle

operations

nk

n

knt

operationsknT

p )1(),(

#),(

Page 15: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 15 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Pipeline efficiency:

Pipeline efficiency reaches 1 (peak performance) if a infinite operation stream without bubbles or stalls is executed. This is of course only a best case analysis.

Practical evaluation: Hockney numbers:n∞ : pipeline peak performance at infinite number of operationsn½ : # of operations at which the pipeline reaches its half peak performance

1)1n(k

n

n)k,(E

)1n(k

n

))1n(k(k

kn)k,n(S

k

1)k,n(E

lim

Basic pipeline measures

Page 16: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 16 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

results

stage

F

. . .

instructionsand

operandsf1 f2 f3 fk

Pipeline stages

Stages are seperated by registers

Page 17: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 17 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Partitioning of an operation F:

If a partitioning of an operation is impossible, F can also beapplied in parallel and overlapped over two clock cycles.

time tf/2 time tf/2

time tf/2

time tf

time tf

time tf

F

F

F

1

1`

22´

f1 f2

Page 18: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 18 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Operation example for partitioning

time tf/2 time tf/2

time tf/2

time tf

time tf

time tf

F

F

F

1

1`

22´

f1 f2i

t

ii+1i+2i+3

t+1t+2t+3t+4t+5

i+1i+2i+3

tt+1t+2t+3t+4t+5

Page 19: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 19 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

If tfi = max(tf1 ... tfk) determines the clock frequency in an unbalancedpipeline (tfi >> tf1, ... , tfi >> tfk), fi should be partitioned further for better performance

f1 f2 f3

f1

f1

f2

f2

f2

f2

f2bf2a f2c

f3

f1 << f2

f2 >> f3

version 2

version 1 f3

Balancing Pipeline Suboperations

Page 20: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 20 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Overall pipelined execution time of an operation F:

t (F) = (max (tfi) + tpd + tsu) • k

corresponds to clock period # of stages

= k max ( tfi ) + k ( tpd + tsu )

max. processing time register delay of a suboperation

Overall execution time, clock frequency

Clock period:

cp = max (tfi) + tpd + tsu

Register delays:

tpd = propagation delay time

tsu = set up time

Page 21: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 21 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Architecture of a linear 5-stage pipeline with registers

OR

OR

OR

OR

IF ID OF EX WB

WB

IR

DECR

RF DC

PC

IC

IC = instruction cacheDC = data cacheIR = instruction registerCR = control registerRF = register file, e.g. 3-gate register fileDE = decoder (control unit)OR = operand registerPC = program counter

IF = instruction fetchID = instruction decodeOF= operand fetchEX = executeWB = write back

ALU

Page 22: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 22 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Pipeline hazards

So far, we have assumed a smooth throughput of operations through the pipeline

But, there are several effects which can cause stalls in pipelined operations

These effects are called pipeline hazards

Pipeline hazards can be caused by

• dataflow dependencies

• resource dependencies

• controlflow dependencies

Page 23: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 23 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Dataflow dependencies

Pipelined processors have to consider 3 classes of dataflow dependencies. The same

dependencies have to be considered in concurrency.

1. true dependency: read after write (RAW)

destination (i) = source (i +1)

X A + B instruction i

Y X + B instruction i+1

A hazard occurs if the distance of two instructions is smaller than the number of pipelines stages. In this case X has to be read before it is created.

X has to be written by instruction i before it is read by the succeeding instruction.

Page 24: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 24 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

2. anti dependency: write after read (WAR)

source (i) = destination (i +1)

Y has to be read by instruction i before it is written by the succeeding instruction.

X Y + B instruction i

Y A + C instruction i +1

Dataflow dependencies

A hazard occurs if the order of the instructions is changed in the pipeline.

Page 25: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 25 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

3. output dependency: write after write (WAW)

destination (i) = destination (i + 1)

Both instructions write their results into the same register.

Y A / B instruction i

Y C + D instruction i + 1

Dataflow dependencies

A hazard occurs if the order of the instructions is changed in the pipeline.

Page 26: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 26 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Example of a short assembler program containing a true dependency, anti dependencies and a output dependency.

I1 ADD R1,2,R2 ; R1 = R2+2I2 ADD R4,R3,R1 ; R4 = R1+R3I3 MULT R3,3,R5 ; R3 = R5·3I4 MULT R3,3,R6 ; R3 = R6·3

I1

I2

I3

I4

true dependency

anti dependency

output dependency

anti depen-dency

Dependency graph

Page 27: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 27 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Example of a true dependency hazard(RAW) in a 5-staged pipeline

i+1 write Y

i write X

i+1

i+1

i + 1 read X, C

i+1 Y:=X op C

i

i

i read A, B

i X:=A op B

fetch decode read execute write

issuepoint

pipelinestages

t

issuechecki + 1

RAW

i: X:=A op B

i+1: Y:=X op C

Page 28: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 28 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Solutions for true dependency hazards

Software solutions:

• Inserting NOOP instructions

• Reorder instructions

Hardware solutions:

• Pipeline interlocking

• Forwarding

Any combinations of these solutions are possible as well

Page 29: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 29 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

i+1 write Y

i write X

i + 1 read X, C

i+1 Y:=X op C

i

i

i read A, B

i X:=A op B

fetch decode read execute write

pipelinestages

t

NOOPs inserted by compiler or programmer

Solving a true dependency hazard by inserting NOOPs

The RAW hazard is eliminated through insertion of NOOPs (bubbles) into the pipeline.This was the solution used in first RISC processors.

NOOPs

i+1

i+1

NOOP

NOOP

Page 30: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 30 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Solving a true dependency hazard by reordering instructions

Sometimes, instead of inserting NOOPs instructions can be reordered to have the same effect

Therefore, instructions having no true dependencies and not changing the control flow are arranged in between the conflicting instructions

Example:

X:=A op B

NOOP

NOOP

Y:=X op C

Z:=D op E

F:= INP(0)

X:=A op B

Z:=D op E

F:= INP(0)

Y:=X op C

Page 31: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 31 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Solving a true dependency hazard by pipeline interlocking

Pipline interlocking means the pipeline processing is delayed by hardware until the conflict is solved

So the compiler or programmer is relieved (used e.g. in MIPS processor,Microprocessor with Interlocked Pipeline Stages)

i+1 write Y

i write X

i+1

i+1

i + 1 read X, C

i+1 Y:=X op C

i

i

i read A, B

i X:=A op B

fetch decode read execute write

issuepoint

pipelinestagest

Interlocking

issuechecki + 1

Page 32: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 32 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Forwarding

Forwarding is simple hardware technique to save one delay slot (NOOP).

An operand X needed for instruction i + 1 is directly forwarded from the output of the ALU to the input. The register file is by passed.

If more then one delay slot is necessary, forwarding is combined with interlocking or NOOP insertion.

The data forwarding path can also be used to provide operands of waiting instruction from the cache.

This shortens the delay slot between a load and an execute instruction using this operand.

Data cache access is speed up excessive through this technique.

Page 33: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 33 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

cache memory register ALU

bypass: load forwarding

bypass:resultforwarding

Load and ResultForwarding

Page 34: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 34 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Hardware realization of the forward path

i+1 write Y

i write X

i+1i+1

i+1read X,Ci+1Y:=X op C

ii

i read A, Bi X:=A op B

fetch decode read execute write

issuepoint

pipeline stages

tdataforwarding

RFread

RFwriteEX

(R)

load data path(load forwarding)

(S1)

(S2)

forward controldata forwarding path(result forwarding)

1 NOOP or interlocking

issuecheck for

i + 1

Page 35: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 35 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Anti- and output-dependency hazards(false dependencies)

An output dependency hazard may occur if an instruction i needs

more time units to execute than instruction i+ 1.

Of course this is only possible if the processor consist of several

processing units with different numbers of stages.

Anti-dependency hazards only occur if the order of instructions is

changed in the pipeline.

This is never true for ordinary scalar pipelines

In superscalar pipelines, this hazard occurs

Page 36: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 36 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Output dependency hazard(regarding only 3 stages of the 5 stage pipeline)

RFread

RFwrite

read execute write

FU 2

FU 1

stages

i read A, B

i 2. A op B

i 3. A op B

i write Y

t

issue iissue i+1 i+1 read C, D

i+1 write Y

i +1 C op D

i 1. A op BFU1

FU2

Page 37: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 37 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Removing false dependencies

False dependencies can always be removed by register renaming

This can be done by hardware or by compiler

So the hazard will never occur

Example:

X:= Y op B Y:= A op B

Y:= A op C Y:= C op D

Renaming the second Y to Z:

X:= Y op B Y:= A op BZ:= A op C Z:= C op D

Page 38: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 38 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Resource dependencies

An intra-pipeline dependency occurs if instructions of two succeeding stages need the same pipeline resource.

The succeeding instruction (and the following instructions) have to be delayed till the resource becomes available.

This happens e.g. if the common register file lacks a sufficient number of ports or some instructions need more than one clock cycle to run through a particular pipeline resource

Examples: a register file with a common read/write port (possible conflict of read in stage 3 with write in stage 5) or a multi-cycle division unit in the execute stage.

Resource dependencies can be classified in:

• intra-pipeline dependencies

• instruction class dependencies

Page 39: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 39 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Resource dependencies

An instruction class dependency occurs if two or more instructions which are in the same pipeline stage need a pipeline resource existing only once.

This never happens in a scalar pipeline

Superscalar processors with several execution units often face this sort of conflict.

A twofold superscalar processor may issue two instructions to two execution units simultaneously.

If these instructions need the same (only once existent) execution unit an instruction class dependency arises.

Page 40: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 40 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Control flow dependencies

Every change in control flow is a potential candidate for conflict.

Several instruction classes cause changes in control flow:

• conditional branch

• jump

• jump to subroutine, return from subroutine

The control flow target is not yet available when the next instruction is

to be fetched

Especially conditional branches cause severe conflicts

The analysis of the condition determines the next instruction to issue,

which usually is finished in the last pipeline stages

Page 41: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 41 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

BRANCH COND

CMP

BRANCH COND

CMPBRANCH COND

BRANCH COND

NEXT CORRECT I

CMP

CMP

CMP

BRANCH COND

IF ID OF EX WB

condition code

Example of a control flow hazarddue to a conditional branch

Control flow hazards

Page 42: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 42 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Solutions for control flow hazards

Software solutions:

• Inserting NOOP instructions

• Reorder instructions

Hardware solutions:

• Pipeline interlocking

• Forwarding

• Fast compare and jump logic

• Branch prediction

Page 43: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 43 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

BRANCH COND CMP

BRANCH COND CMP

BRANCH COND

BRANCH COND

NEXT CORRECT I

NEXT+1 CORRECT I NEXT CORRECT I

NEXT+1 CORRECT I

NEXT CORRECT I

NEXT+1 CORRECT I NEXT CORRECT I

NEXT+1 CORRECT I

CMP

CMP

CMP

BRANCH COND

NEXT CORRECT I

NEXT+1 CORRECT I

DELAYSLOT1

DELAYSLOT2

IF ID OF EX WB

condition code

Solution: interlocking or NOOPinsertion

NOOP or interlocking

Penalty: 6 cycles

Page 44: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 44 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

IF ID OF EX WB

condition code

CMPBRANCH COND

CMPBRANCH COND

BRANCH COND

NEXT CORRECT I

NEXT+1 CORRECT I NEXT CORRECT I

NEXT+1 CORRECT I

NEXT CORRECT I

NEXT+1 CORRECT I NEXT CORRECT I

NEXT+1 CORRECT I

CMPCMP

CMPBRANCH COND

NEXT CORRECT I

NEXT+1 CORRECT I

DELAYSLOT2

Reducing penalty by forwarding the comparison result

BRANCH COND

Penalty: 4 cycles

NOOP or interlocking

Page 45: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 45 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

IF ID OF EX WB

condition code

BRANCH COND

BRANCH COND

CMP

BRANCH CONDNEXT CORRECT I

NEXT+1 CORRECT I NEXT CORRECT I

NEXT+1 CORRECT I

NEXT CORRECT I

NEXT+1 CORRECT I NEXT CORRECT I

NEXT+1 CORRECT I

CMP

CMP

NEXT CORRECT I

NEXT+1 CORRECT I

CMP

BRANCH COND CMP

BRANCH COND

DELAYSLOT2

Reducing penalty by forwarding the next correct instruction address

NOOP or interlocking

Penalty: 3 cycles

NOOP or interlockingNOOP or interlocking

Page 46: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 46 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

condition code

DELAYSLOT2

IF ID OF EX WB

BRANCH COND

BRANCH COND

NEXT CORRECT I

NEXT+1 CORRECT I NEXT CORRECT I

NEXT+1 CORRECT I

NEXT CORRECT I

NEXT+1 CORRECT I NEXT CORRECT I

NEXT+1 CORRECT I

CMPCMP

NEXT CORRECT I

NEXT+1 CORRECT I

CMPBRANCH COND CMP

BRANCH COND CMPBRANCH COND

fastjumplogic

fastcompare

logic comparisonresult

Reducing penalty by fast compareand jump logic

Penalty: 2 cycles

NOOP or interlocking

Page 47: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 47 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Reducing penalty by fast compareand jump logic

Special logic for compare and jump instructions can reduce the penalty

by one cycle.

These circuits can be much faster than a more general execution unit

(ALU) allowing to complete comparison and jump in one clock cycle.

The higher speed of the fast compare logic is possible because

normally only simple comparisons like equal, unequal, <0, >0, ≤0, ≥0,

=0 are needed.

Page 48: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 48 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Reducing penalty by fast compareand jump logic + reorder instructions

The remaining 2 NOOPs or interlockings can be replaced by reordering code

Two independent instructions could be moved after the branch instruction (delayed branch)

Example:

Z:=D op E

F:= INP(0)

CMP

BRANCH COND

NOOP

NOOP

NEXT INSTR (COND = FALSE)

. . .

NEXT INSTR (COND = TRUE)

CMP

BRANCH COND

Z:=D op E

F:= INP(0)

NEXT INSTR (COND = FALSE)

. . .

NEXT INSTR (COND = TRUE)

Page 49: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 49 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Branch prediction

Another possibility of avoiding control flow hazards is branch prediction

Here, the outcome of the branch (taken or not taken) is predicted before

the result of the comparison is known

In case of correct branch prediction, the penalty can be reduced up to 0

Firstly, lets assume we would have a perfectly working branch predictor

Page 50: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 50 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

prediction result (taken or not taken)

IF ID OF EX WB

branchpredictor

Reducing penalty by branch prediction

BRANCH COND

BRANCH COND

NEXT CORRECT I

NEXT+1 CORRECT I NEXT CORRECT I

NEXT+1 CORRECT I

NEXT CORRECT I

NEXT+1 CORRECT I NEXT CORRECT I

NEXT+1 CORRECT I

CMPCMP

NEXT CORRECT I

NEXT+1 CORRECT I

CMPBRANCH COND CMP

BRANCH COND CMPBRANCH COND

Penalty: still 2 cycles

next address

Page 51: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 51 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Branch target address cache

To further reduce the penalty, a branch target address cache (BTAC)

can be introduced

This cache holds the addresses of

branches and the corresponding

target addresses

Therefore, if filled already in the

fetch phase a branch and its

possible target address can

be identified

branch address target address

. . . . . .

branch target address

part of branch address (e.g. lower m bits)

BTAC

Page 52: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 52 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

NEXT CORRECT I

prediction result

IF ID OF EX WB

branchpredictor

Reducing penalty by branch prediction and branch target addresscache

BRANCH COND

BRANCH COND

CMPCMP

CMPBRANCH COND CMP

BRANCH COND CMPBRANCH COND

Penalty: 0 cycles

next address

BTAC

NEXT+1 CORRECT I

NEXT CORRECT INEXT+1 CORRECT I

NEXT CORRECT INEXT+1 CORRECT I

NEXT CORRECT INEXT+1 CORRECT I

NEXT CORRECT INEXT+1 CORRECT I

Page 53: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 53 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Branch prediction and pipeline utilization

For having 0 cycles penalty, two prerequisites must meet:

• the branch address must be stored in the BTAC

• the branch prediction must be correct

Otherwise we will get a penalty

Page 54: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 54 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Branch prediction and pipeline utilization

In case of a BTAC miss, the penalty will be pb (in our example 2)

In case of a misprediction, the penalty will be the number of cycles pm needed to

flush the pipeline (e.g. 5)

In modern processors, this can be much more (e.g. 11 for Pentium II)

The overall penalty calculates to:

p = m pm + (1 - m) b pb with m: miss prediction rate, b: BTAC miss rate

The pipeline utilization can be calculated to:

u = n / (n + p) with n: number of instructions

So, an excellent branch prediction is necessary

Page 55: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 55 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Branch prediction techniques

In general, two classes of branch prediction techniques can be

distinguished:

• static branch prediction

for a given branch, the prediction is always the same, it never

changes

• dynamic branch prediction

for a given branch, the prediction changes dynamically

Page 56: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 56 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Static branch prediction

• Predict always not taken

most simple technique, no BTAC necessary, in the first attempt the

branch is always ignored

• Predict always taken

a bit more complicated, needs a BTAC to take the branch in the first

attempt. Produces slightly better results

• Predict backward taken, forward not taken

loop-oriented prediction, a backward branch often belongs to a loop and

therefore is taken quite often

• Compiler controlled

the compiler sets a bit for each branch to tell the processor how to

predict the branch. Still static since it never changes during runtime

Page 57: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 57 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Dynamic branch prediction

Dynamic branch prediction means that information about the probability of a

branch is collected at runtime.

Dynamic branch prediction is based on knowledge about the past behavior of the

branch.

This knowledge can be stored in a table and can be addressed

through the address of the branch instruction.

Often, this information is stored is well in the BTAC but there are also solutions

with separate tables

Dynamic branch prediction produces much better results then static branch

prediction.

Today, a misprediction rate below 10% is possible

Page 58: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 58 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Using the BTAC to store branch history information

branch address target address history bits

. . . . . . . . .

part of branch address (e.g. lower m bits)

BTAC

branch target address branch history

Page 59: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 59 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Interferences

Only a part of the branch address is used as index to the table containing branch history

If two branches have a identical bit pattern in this part, they share the same table entry => interference

This often leads to mispredictions, because one branch messes up the history of the other one

As larger the history table, as less interferences occur

Best case: all bits of the branch address would be used as an index => no interferences

Due to limited chip space, this is not possible for large programs

Page 60: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 60 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

One bit predictor

Most simple predictor, only one bit is used to store the branch history

For each branch, two states (taken, not taken) dependent of the last

execution are stored

The prediction always refers to the last state

NT

NTT

T

Predict TakenPredict Not

Taken

Page 61: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 61 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Two bit predictor

Two bits per branch to store history

This results in for states (strongly taken, weakly taken, weakly not

taken, strongly not taken)

In a strong states, it takes two mispredictions to change the prediction

NT

NTT

T

(11)

Predict Strongly

Taken

NT

T

NT

T

(00)

Predict Strongly

Not Taken

(01)

Predict Weakly

Not Taken

(10)

Predict Weakly

Taken

Two bit predictor with saturation counter

Page 62: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 62 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Two bit predictor

Two bit predictor with hysteresis counter

NT

NT

T

T

(11)

Predict Strongly

Taken

NT

T

NT

T

(00)

Predict Strongly

Not Taken

(01)

Predict Weakly

Not Taken

(10)

Predict Weakly

Taken

Page 63: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 63 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

One bit predictor versus two bit predictor

One bit predictor is simpler and needs less memory

For a branch at the end of a loop, the one bit predictor correctly predicts the

branch direction as long as the loop is iterated

In a nested loop, each iteration of the outer loop produces two mispredictions in

the inner loop

A two bit predictor avoids one of these two mispredictions

Technique can be extended to n bits, but no significant improve in performance

one bit predictor

mispred. when left inner loopmispred. when reentered inner loop

two bit predictor

mispred. when left inner loop

Page 64: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 64 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Correlation predictors

Often, branches are not independent

Example:

DEC A

BRZ X

. . .

X: LD A,0

BRZ Y

The second branch is always taken when the first branch is taken

Both branches are correlated

This is not exploited by the one or two bit predictors

Page 65: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 65 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Correlation predictors

One or two bit predictors only use self-history

Correlation predictors also use neighbor-history

This means, the own history and the history of neighbored, in execution

order preceding branches are used

Notation: a (m,n) predictor uses the last m branches to select one of 2m

predictors, while each of these predictors is a n bit predictor for a single

given branch

A branch history register (BHR) is used to store the direction of the last m

branches in a m-bit shift register

The BHR is used as an index to select a pattern history table (PHT)

Page 66: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 66 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Implementation of a (2,2) predictor

...

...

...

...

...

...

P a tte rn H is to ry Tab le s P H Ts (2 -b it p red ic to rs)

...

...

1 1

B ran ch ad d re ss

1 0

0B ran ch H is to ry R eg is te r B H R (2 -b it sh ift reg is te r) 1

se lec t

Page 67: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 67 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Two level adaptive predictors

Two level adaptive predictors have been developed by Yeh and Patt

nearly the same time as the correlation predictors (1992)

Like the correlation predictor, the two level adaptive predictor uses two

levels of tables, while the first level is used to select prediction bits of the

second level

Variants of two level adaptive predictors:

  global PHT

per-set PHTs

per-address PHTs

global scheme (global BHR) GAg GAs GAp

per-address-scheme (per-address BHT)

PAg PAs PAp

per-set-scheme (per-set BHT) SAg SAs SAp

Correlation predictors

Page 68: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 68 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Two level adaptive predictors

Examples:

GAg(4) GAp(4)

PAg(4) PAp(4)

For the s/S variants, only part of the branch address is used

Page 69: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 69 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

gshare and gselect predictors

When using a global PHT, parts of the branch address bits and the BHR can be

combined in two ways to address a PHT entry:

gselect: branch address bits and BHR are concatenated

gshare: branch address bits and BHR are XORed

gshare performs a bit better than gselect due to less interferences

Example:

branch addr BHR gselect4/4 gshare8/8

00000000 00000001 00000001 00000001

00000000 00000000 00000000 00000000

11111111 00000000 11110000 11111111

11111111 10000000 11110000 01111111

Page 70: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 70 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Hybrid predictors

A hybrid or combined predictor consists of two different branch predictors and a

selection predictor choosing one of two branch predictor results for each branch

prediction

Any predictor can be used as selection predictor

Examples:

McFarling: two bit predictor combined with gshare

Young and Smith: compiler controlled static predictor combined with

two level adaptive predictor

Often, a simple predictor with reasonable results in the warm-up phase is combined

with a sophisticated predictor delivering better results later

The combined predictors are often better then the individuals

Page 71: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 71 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Misprediction rates

SAg, gshare and McFarling:

committed conditional takenApplication instructions branches branches (in millions) (in millions) (%) SAg gshare combining

compress 80.4 14.4 54.6 10.1 10.1 9.9gcc 250.9 50.4 49.0 12.8 23.9 12.2perl 228.2 43.8 52.6 9.2 25.9 11.4go 548.1 80.3 54.5 25.6 34.4 24.1m88ksim 416.5 89.8 71.7 4.7 8.6 4.7xlisp 183.3 41.8 39.5 10.3 10.2 6.8vortex 180.9 29.1 50.1 2.0 8.3 1.7jpeg 252.0 20.0 70.0 10.3 12.5 10.4mean 267.6 46.2 54.3 8.6 14.5 8.1

misprediction rate(%)

Page 72: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 72 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Multipath execution: in case of a branch both paths are followed by the processor simultaneously, the wrong path is discarded later

Multipath execution

RF read

ALU RF write

instruction issue point

DEC

DEC

IF

IF

CC

a simple multipath pipeline with two instruction fetch and decode stages

Page 73: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 73 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Predication

Predication means, the execution of an instruction is dependend on a predicate

Only if the predicate is true the instruction is executed

If all instructions of an instructions set supports predication, this is called a fully predicated instruction set

Examples for fully predicated instruction sets: IA64 Itanium, ARM,

Fully predicated instruction sets can avoid conditional branches

Example:

CMP A, 0 CMP A, 0, PBZ L1 P.ADD B,CADD B,C P.SUB C,D

SUB C,D LD A,3L1: LD 3,A

with cond. branch predicated

Page 74: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 74 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Predication

On the hardware side, the predicated instruction is executed anyway.

In case of a false predicate, the result of the instruction is discarded

Advantages:

• conditional branches can be avoided

• no speculation necessary

• basic block length is increased resulting in better compiler optimization

Disadvantages:

• unnecessary execution of instructions

• additional predicate bits necessary in instruction format

Page 75: Part 8 Instruction Level Parallelism (ILP) - Pipelining

Hier wird Wissen Wirklichkeit Computer Architecture – Part 8 – page 75 of 75 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Trace cache

A trace is a sequence of executed instructions which can span several basic blocks

Therefore, in a trace all branches are solved

A trace cache stores such traces while the trace is executed

If the same trace is executed again, the instruction sequence can be taken from the trace cache, no branch needs to be exectued

While an instruction cache contains the static instruction sequence, the trace cache contains the dynamic instruction sequence

Example for a trace cache: Pentium 4

I -c a c h e T ra c e C a c h e