Chapter 3 & Appendix C - cs.sjtu.edu.cnshen-yy/cs359/slides/4_pipelining_D.pdf · Pipelining become universal technique in 1985 There are two main approaches to exploit ILP Hardware-based

1

CS359: Computer Architecture

Chapter 3 & Appendix CPart B: ILP and Its Exploitation

Yanyan ShenDepartment of Computer Science and EngineeringShanghai Jiao Tong University

2

Chapter 3: ILP and Its Exploitation

Outline

3.1 Concepts and Challenges 3.2 Basic Compiler Techniques 3.3 Reducing Branch Costs with Advanced

Branch Prediction 3.4 Overcoming Data Hazards with Dynamic

Scheduling 3.5 Hardware-based Speculation

3


Background

“Instruction Level Parallelism (ILP)” refers to overlap execution of instructions Pipelining become universal technique in 1985

There are two main approaches to exploit ILPHardware-based dynamic approaches

Used in server and desktop processors Compiler-based static approaches

Not as successful outside of scientific applications

The goal of this chapter: 1) to look at the limitation imposed by data and control hazards; 2) Investigate the topic of increasing the ability of the compiler

(software) and the processor (hardware) to exploit ILP

4


Pipeline CPI

When exploiting instruction-level parallelism, the goal is to minimize CPI

Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data hazard stalls + Control Stalls

Ideal pipeline CPI is a measure of the maximum performance attainable by the implementation

By reducing each of the terms in the right-hand side, we decrease the overall pipeline CPI or improve IPC (instructions per clock)

5


Major Techniques for Improving Pipeline CPI

6


What is Instruction-Level Parallelism

Basic block: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit

Parallelism with a basic block is limited The average dynamic branch frequency is often between

15% and 25% Thus, typical size of basic block = 3-6 instructions

Conclusion: Must optimize across branches

7


Dependences and Hazards

Dependences:

Determining how one instruction depends on another is critical to determining 1) how much parallelism exists in a program and 2) how that parallelism can be exploited!

Hazards:

When there are hazards, the pipeline may be stopped in order to resolve the hazards

8


Three Types of Dependences

Data dependences (also called true data dependences) Data flow between instructions

Name dependences (anti-dependence and output dependence)

Control dependence

9


Data Dependences

Dependencies are a property of programs

Pipeline organization determines if a dependenceresults in an actual hazard being detected and if the hazard causes a stall

Data dependence conveys: Possibility of a hazard Order in which results must be calculatedUpper bound on exploitable instruction level parallelism

10


Name Dependences

Two instructions use the same register or memory location but no flow of informationNot a true data dependence, but is a problem when

reordering instructionsAntidependence: instruction j writes a register or

memory location that instruction i reads Initial ordering (i before j) must be preserved

Output dependence: instruction i and instruction jwrite the same register or memory location Ordering must be preserved

To resolve, use renaming techniques

11


Control Dependence

Ordering of instruction i with respect to a branch instruction

Instruction control dependent on a branch cannot be moved before the branch so that its execution is no longer controller by the branch

An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

12


Data Hazards

• j tries to read a source before i writes it• Program order must be preserved!

Read after write (RAW)

• j tries to write an operand before it is written by i

Write after write (WAW)

• j tries to write a destination before it is read by i, so iincorrectly gets the new value

Write after read (WAR)

13


Outline

3.1 Concepts and Challenges 3.2 Basic Compiler Techniques 3.3 Reducing Branch Costs with Advanced

Branch Prediction 3.4 Overcoming Data Hazards with Dynamic

Scheduling 3.5 Hardware-based Speculation

14


Latencies of FP operations

A compiler’s ability to perform pipeline scheduling is dependent both on the amount of ILP available in the program and on the latencies of the functional units in the pipeline

To avoid a pipeline stall, the execution of a dependent instruction must be separated from

the source instruction by a distance in clock cycles equal to the pipeline latency of that

source instruction

15


Pipeline Stalls

Loop: L.D F0,0(R1)stallADD.D F4,F0,F2stallstallS.D F4,0(R1)DADDUI R1,R1,#-8stall (assume integer load latency is 1)BNE R1,R2,Loop

Example:for (i=999; i>=0; i=i-1)x[i] = x[i] + s;

16


Pipeline Scheduling

Scheduled code:

Loop:L.D F0,0(R1)DADDUI R1,R1,#-8

ADD.D F4,F0,F2stallstallS.D F4,8(R1)BNE R1,R2,Loop

17


Loop Unrolling

Loop: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1) ;drop DADDUI & BNEL.D F6,-8(R1)ADD.D F8,F6,F2S.D F8,-8(R1) ;drop DADDUI & BNEL.D F10,-16(R1)ADD.D F12,F10,F2S.D F12,-16(R1) ;drop DADDUI & BNEL.D F14,-24(R1)ADD.D F16,F14,F2S.D F16,-24(R1)DADDUI R1,R1,#-32BNE R1,R2,Loop

Unroll by a factor of 4 (assume # elements is divisible by 4)Eliminate unnecessary instructions

18


Loop Unrolling/Pipeline Scheduling

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)ADD.D F4,F0,F2ADD.D F8,F6,F2ADD.D F12,F10,F2ADD.D F16,F14,F2S.D F4,0(R1)S.D F8,-8(R1)DADDUI R1,R1,#-32S.D F12,16(R1)S.D F16,8(R1)BNE R1,R2,Loop

Documents

Chapter 3 & Appendix C - cs.sjtu.edu.cnshen-yy/cs359/slides/4_pipelining_D.pdf · Pipelining become universal technique in 1985 There are two main approaches to exploit ILP Hardware-based