Chapter3 Limitations on Instruction-Level Parallelism Bernard Chen Ph.D. University of Central Arkansas

Chapter3 Limitations on Instruction-Level Parallelism

Bernard Chen Ph.D.University of Central Arkansas

Overcome Data Hazards with Dynamic Scheduling If there is a data dependence, the

hazard detection hardware stalls the pipeline

No new instructions are fetched or issued until the dependence is cleared

Dynamic Scheduling: the hardware rearrange the instruction execution to reduce the stalls while maintaining data flow and exception behavior

RAW If two instructions are data dependent,

they cannot execute simultaneously or be completely overlapped

If data dependence caused a hazard in pipeline, called a Read After Write (RAW) hazard

I: add r1,r2,r3J: sub r4,r1,r3

Overcome Data Hazards with Dynamic Scheduling

Key idea: Allow instructions behind stall to proceed

DIV F0 <- F2/F4ADD F10<- F0+F8SUB F12<- F8-F14

Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind

stall to proceedDIV F0 <- F2/F4

SUB F12<- F8-F14

ADD F10<- F0+F8

Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind stall to

proceedDIV F0 <- F2/F4SUB F12<- F8-F14ADD F10<- F0+F8

Enables out-of-order execution and allows out-of-order completion (e.g., SUB)

In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue)


It offers several advantages: Simplifies the compiler It allows code that compiled for one

pipeline to run efficiently on a different pipeline

(Allow the processor to tolerate unpredictable delays such as cache misses)

Overcome Data Hazards with Dynamic Scheduling However, Dynamic execution creates WAR and WAW hazards and makes exceptions harder

Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name;

There are 2 versions of name dependence

WAR

InstrJ writes operand before InstrI

reads it If it caused a hazard in the

pipeline, called a Write After Read (WAR) hazard

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7

WAW

InstrJ writes operand before InstrI

writes it. If anti-dependence caused a

hazard in the pipeline, called a Write After Write (WAW) hazard

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7

Example

DIV r0 <- r2 / r4ADD r6 <- r0 + r8SUB r8 <- r10 – r14MUL r6 <- r10 * r7OR r3 <- r5 or r9

Example RAW

Example WAR

Example WAW

For you to practice

DIV r0 <- r2 / r4 ADDr6 <- r0 + r8 ST r1 <- r6 SUB r8 <- r10 - r14 MULr6 <- r10 * r8


Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict Register renaming resolves name

dependence for regs Either by compiler or by HW

Limits to ILP

Assumptions for ideal/perfect machine to start:1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided2. Branch prediction – perfect; no mispredictions 3. Perfect Cache

Ideal Model IBM Power 5

Instructions Issued per clock

Infinite 4

Renaming Registers Infinite 48 integer + 40 Fl. Pt.

Branch Prediction Perfect 2% to 6% misprediction

Cache Perfect 1.92MB L2, 36 MB L3

Limits to ILP HW Model comparison

Performance beyond single thread ILP

There can be much higher natural parallelism in some applications

Such as “Online processing system”: which has natural parallelism among the multiple queries and updates that are presented by requests

Thread-level parallelism (TLP)

Thread: process with own instructions and data thread may be a process part of a

parallel program of multiple processes, or it may be an independent program

Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute

Thread-level parallelism (TLP) TLP explicitly represented by the use of

multiple threads of execution that are inherently parallel

Goal: Use multiple instruction streams to improve

1. Throughput of computers that run many programs

2. Execution time of multi-threaded programs

TLP could be more cost-effective to exploit than ILP

New Approach: Mulithreaded Execution Multithreading: multiple threads to

share the functional units of 1 processor via overlapping

Processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table

New Approach: Mulithreaded Execution

When switch? Alternate instruction per thread

(fine grain) When a thread is stalled,

perhaps for a cache miss, another thread can be executed (coarse grain)

Fine-Grained Multithreading Switches between threads on each

instruction, causing the execution of multiples threads to be interleaved

Usually done in a round-robin fashion, skipping any stalled threads

CPU must be able to switch threads every clock

Multithreaded Categories

Thread 1 Thread 2 Thread 3 Thread 4

Thread 5

Fine-Grained


Fine-Grained Multithreading Advantage is it can hide both short and

long stalls, since instructions from other threads executed when one thread stalls

Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads

Course-Grained Multithreading Switches threads only on costly stalls,

such as cache misses Advantages

Relieves need to have very fast thread-switching

Doesn’t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall

Course-Grained Multithreading Disadvantage is hard to overcome throughput

losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread,

when a stall occurs, the pipeline must be emptied or frozen

New thread must fill pipeline before instructions can complete

Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time


Thread 1 Thread 2 Thread 3 Thread 4Thread 5

Coarse-Grained (2clock cycle)

Documents

Chapter3 Limitations on Instruction-Level Parallelism Bernard Chen Ph.D. University of Central Arkansas