81
Yonsei Yonsei University University Chapter 13 Chapter 13 Instruction-Level Parallelism and Superscalar Processors

Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity

Chapter 13Chapter 13

Instruction-LevelParallelism and

Superscalar Processors

Page 2: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-2

ContentsContents

• Overview• Design Issues• Pentium Ⅱ• PowerPC• MIPS R1000• UltraSPARC Ⅱ• IA-64/Merced

Page 3: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-3

OverviewOverview

• The essence of the superscalar approach is the ability to execute instructions independently in different pipelines

• The concept can be further exploited by the allowing instructions to be executed in an order different from the program order

Overview

Page 4: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-4

General Superscalar OrganizationGeneral Superscalar OrganizationOverview

Page 5: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-5

Reported SpeedupsReported Speedups

7[LEE91]

2.2[JOUP89b]

2.3[SMIT89]

1.8[SOHI90]

2.7[ACOS86]

1.58[WEIS84]

8[KUCK72]

1.8[TJAD70]

SpeedupReference

Overview

Page 6: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-6

Superscalar versus Superscalar versus SuperpipelinedSuperpipelined

• Many pipeline stages perform tasks that require less than half a clock cycle

• Doubled internal clock speed allows the performance of two tasks in one external clock cycle

Overview

Page 7: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-7

ComparisonComparisonOverview

Page 8: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-8

Superscalar versus Superscalar versus SuperpipelinedSuperpipelined

• Superpipeline– The function performed in each stage can be split into 2

nonoverlapping parts and each can execute in half a clock cycle

– A superpipeline implementation that behaves in this fashion is said to be of degree 2

• Superscalar– Capable of executing 2 instances of each stage in

parallel

• The superpipelined processor falls behind the superscalar processor at the start of the program and at each branch target

Overview

Page 9: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-9

LimitationsLimitations

• Superscalar approach depends on the ability to execute multiple instructions in parallel

• The term instruction-level parallelism refers to the degree to which, on average, the instructions of a program can be executed in parallel

• A combination of compiler-based optimization and hardware techniques can be used to maximize instruction-level parallelism

Overview

Page 10: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-10

LimitationsLimitations

• True data dependency• Procedural dependency• Resource conflicts• Output dependency• Antidependency

Overview

Page 11: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-11

True Data DependencyTrue Data Dependencyadd r1, r2move r3, r1

• The second instruction needs data produced by the first instruction

Overview

Page 12: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-12

True Data DependencyTrue Data DependencyOverview

i0

i1

i0

i1

0 1 2 3 4 5 6 7 8 9

No Dependency

Data Dependency(i1 uses data computer by i0)

Page 13: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-13

True Data DependencyTrue Data Dependency• With no dependency, two instructions can

be fetched and executed in parallel• If there is a data dependency between the

first and second instructions, the the second instruction is delayed as many clock cycles as required to remove the dependency

Overview

Page 14: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-14

True Data DependencyTrue Data Dependencyload r1, eff ; load register r1 with the contents of

; effective memory address effmove r3, r1 ; load register r3 with the contents of r1

• A typical RISC processor takes two or more cycles to perform a load from memory because of the delay of an off-chip memory or cache access

• One way to compensate for this delay is for the compiler to recorder instructions so that one or more subsequent instructions that do not depend on the memory load can begin flowing through the pipeline

Overview

Page 15: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-15

Procedural DependenciesProcedural Dependencies

• The instructions following a branch(taken or not taken) have a procedural dependency on the branch and cannot be executed until the branch is executed

• This type of procedural dependency also affects a scalar pipeline. The consequence for a superscalar pipeline is more severe, because a greater magnitude of opportunity is lost with each delay

Overview

Page 16: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-16

Procedural DependenciesProcedural DependenciesOverview

i0

i1/branch

i2

i3

i4

i5

0 1 2 3 4 5 6 7 8 9

Page 17: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-17

Resource ConflictResource Conflict

• A resource conflict is a competition of two or more instructions for the same resource at the same time

• Resource conflicts can be overcome by duplication of resources

Overview

Page 18: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-18

Resource ConflictResource ConflictOverview

i0

i1

i0

i1

0 1 2 3 4 5 6 7 8 9

No Dependency

Resource Conflict(i0 and i1uses the samefunctional unit)

Page 19: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-19

InstructionInstruction--Level ParallelismLevel Parallelism

• Instruction-level parallelism exists when instructions in a sequence are independent and thus can be executed in parallel by overlapping

Load R1 ? R2 Add R3 ? R3, “1”Add R3 ? R3, “1” Add R4 ? R3, R2Add R4 ? R4, R2 Store [R4] ? R0

• Instruction-level parallelism is determined by the frequency of true data dependencies and procedural dependencies in the code

Design Issues

Page 20: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-20

Machine ParallelismMachine Parallelism

• Machine parallelism is a measure of the ability of the processor to take advantage of instruction-level parallelism

• Machine parallelism is determined by the number of instructions that can be fetched and executed at the same time (the number of parallel pipelines) and by the speed and sophistication of the mechanisms that the processor uses to find independent instructions

Design Issues

Page 21: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-21

InstructionInstruction--Issue PolicyIssue Policy• Instruction-issue policy to refer to the

protocol used to issue instructions• Three types of orderings

– The order in which instructions are fetched– The order in which instructions are executed– The order in which update the contents of register and

memory locations

• Superscalar instruction issue policies– In-order issue with in-order completion– In-order issue with out-of-order completion– Out-of-order issue with out-of-order completion

Design Issues

Page 22: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-22

• The simplest instruction-issue policy is to issue instructions in the exact order that would be achieved by sequential execution (in-order issue) and to write results in that same order (in-order completion)

Design IssuesInIn--Order Issue / InOrder Issue / In--Order CompletionOrder Completion

Page 23: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-23

Superscalar Instruction IssueSuperscalar Instruction IssueDesign Issues

((a) Ina) In--order issue and inorder issue and in--order completionorder completion

•I1 requires two cycles to execute•I3 and I4 conflict for the same functional unit•I5 depends on the value produced by I4•I5 and I6 conflict for a functional unit

Page 24: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-24

InIn--Order Issue/OutOrder Issue/Out--ofof--Order CompletionOrder Completion

• With out-of-order completion, any number of instructions may be in the execution stage at any one time, up to the maximum degree of machine parallelism across all functional units

• Instruction issuing is stalled by a resource conflict, a data dependency, or a procedural dependency

• Output dependency(write-write dependency)– I1 : R3 <- R3 op R5– I2 : R4 <- R3 + 1– I3 : R3 <- R5 + 1– I4 : R7 <- R3 op R4

Design Issues

Page 25: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-25

Superscalar Instruction IssueSuperscalar Instruction IssueDesign Issues

((b) Inb) In--order issue and outorder issue and out--ofof--order completionorder completion

Page 26: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-26

OutOut--ofof--Order Issue / CompletionOrder Issue / Completion

• With in-order issue, the processor will only decode instructions up to the point of a dependency or conflict

• No additional instructions are decoded until the conflict is resolved

• It is necessary to decouple the decode and execute stages of the pipeline

• This is done with a buffer referred to as an instruction window

• With this organization, after a processor has finished decoding an instruction, it is placed in the instruction window

Design Issues

Page 27: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-27

Superscalar Instruction IssueSuperscalar Instruction IssueDesign Issues

((c) Outc) Out--ofof--order issue and outorder issue and out--ofof--order completionorder completion

Page 28: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-28

OutOut--ofof--Order Issue / CompletionOrder Issue / Completion• As long as this buffer is not full, the

processor can continue to fetch and decode new instructions

• When a functional unit becomes available in the execute stage, an instruction from the instruction window may be issued to the execute stage

• Any instruction may be issued, provided that (1)it needs the particular functional unit that is available and (2)no conflicts or dependencies block this instruction

Design Issues

Page 29: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-29

OutOut--ofof--Order Issue / CompletionOrder Issue / Completion• An instruction cannot be issued if it violates a

dependency or conflict • The difference is that more instructions are

available for issuing, reducing the probability that a pipeline stage will have to stall

• In addition, a new dependency, which we referred to earlier as an antidependency (also called read-write dependency), arises

Design Issues

Page 30: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-30

OutOut--ofof--Order Issue / CompletionOrder Issue / Completion

• Code fragment– I1 : R3 <- R3 op R5– I2 : R4 <- R3 + 1– I3 : R3 <- R5 + 1– I4 : R7 <- R3 op R4

• The term antidependency is used because the constraint is similar to that of a true data dependency, but reversed

Design Issues

Page 31: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-31

Register RenamingRegister Renaming

• When out-of-order instruction issuing and/or out-of-order instruction completion are allowed, we have seen that this gives rise to the possibility of output dependencies and antidependencies

• Output dependencies and antidependencies arise because the values in registers may no longer reflect the sequence of values dictated by the program flow

Design Issues

Page 32: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-32

Register RenamingRegister Renaming

• In essence, registers are allocated dynamically by the processor hardware, and they are associated with the values needed by instructions at various points in time

• Code fragment

aab RopRRI 533:1 ?

134:2 ?? bb RRI

bcb RopRRI 437:4 ?

153:3 ?? ac RRI

Design Issues

Page 33: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-33

Machine ParallelismMachine Parallelism

• Three hardware techniques that can be used in a superscalar processor to enhance performance – Duplication of resources– Out-of-order issue– Renaming

• The base machine does not duplicate any of the functional units, but it can issue instructions out of order

Design Issues

Page 34: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-34

SpeedupsSpeedupsDesign Issues

Page 35: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-35

Branch PredictionBranch Prediction• Intel 80486 : because there are two pipeline

stages between prefetch and execution, this strategy incurs a two-cycle delay when the branch gets taken

• With the advent of RISC machines, the delayed branch strategy was explored

• This allows the processor to calculate the result of conditional branch instructions before any unusable instructions have been prefetched

• With this method, the processor always executes the single instructions that immediately follows the branch

Design Issues

Page 36: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-36

Branch PredictionBranch Prediction• With the development of superscalar machines,

the delayed branch strategy has less appeal• The reason is that multiple instructions need to

execute in the delay, raising several problems relating to instruction dependencies

• Thus, superscalar machines have returned to pre-RISC technique of branch prediction

Design Issues

Page 37: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-37

Superscalar ExecutionSuperscalar ExecutionDesign Issues

Page 38: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-38

Superscalar ImplementationSuperscalar Implementation• Instruction fetch strategies that

simultaneously fetch multiple instructions• Logic for determining true dependencies

involving register values, and mechanisms for communicating these values to where they are needed during execution

• Mechanisms for initiating, or issuing, multiple instructions in parallel

Design Issues

Page 39: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-39

Superscalar ImplementationSuperscalar Implementation• Resources for parallel execution of multiple

instructions, including multiple pipelined functional units and memory hierarchies capable of simultaneously servicing multiple memory references

• Mechanisms for committing the process state in correct order

Design Issues

Page 40: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-40

Operation of The Pentium IIOperation of The Pentium II• The processor fetch instructions from memory• Each instruction is translated into one or more

fixed-length RISC instructions• The processor executes the results of each

micro-ops on a superscalar pipeline organizations, known as micro-ops

• The processor commits the results of each micro-op execution

PENTIUM II

Page 41: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-41

The Pentium II PipelineThe Pentium II PipelinePENTIUM II

Page 42: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-42

Instruction Fetch and Decode UnitInstruction Fetch and Decode Unit• Fetch operation consists of three pipelined

stages• First, IFU1 stage fetches instructions from the

instruction cache on line(32bytes) at a time• Next, the contents of the IFU1 buffer are

passed to IFU2 16bytes at a time• IFU3 is capable of handling three Pentium II

machine instruction in parallel

PENTIUM II

Page 43: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-43

Pentium II Fetch/Decode UnitPentium II Fetch/Decode UnitPENTIUM II

Page 44: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-44

Reorder BufferReorder Buffer• Each buffer entry consists of the following

fields– State – Memory address– Micro-op– Alias register

• Micro-ops enter the ROB in order• Micro-ops are then dispatched from the

ROB to the dispatch/execute unit out of order

• Micro-ops are retired from the ROB order

PENTIUM II

Page 45: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-45

Dispatch/Execute UnitDispatch/Execute Unit• RS is responsible for retrieving micro-ops

from the ROB, dispatching these for execution, and recording the results back in the ROB

• Five ports attach the RS to the five execution units

• Once execution is complete, the appropriate entry in the ROB is updated and the execution unit is available for another micro-op

PENTIUM II

Page 46: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-46

Pentium II Dispatch/Execute UnitPentium II Dispatch/Execute UnitPENTIUM II

Page 47: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-47

Retire UnitRetire Unit• The retire unit(RU) work off of the reorder

buffer to commit the results of instruction execution

• The RU must take into account branch mispredictions and micro-ops that have executed but for which preceding branches have not yet been validated

PENTIUM II

Page 48: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-48

Branch PredictionBranch Prediction• Pentium II use a dynamic prediction strategy• A branch target buffer(BTB) is maintained

that caches information• Once the instruction is executed, the history

portion of the appropriate entry is updated to reflect the result of the branch instruction

PENTIUM II

Page 49: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-49

PowerPC 601PowerPC 601

• Dispatch Unit• Instruction Pipelines

PowerPC

Page 50: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-50

PowerPC 601 Block DiagramPowerPC 601 Block DiagramPowerPC

Page 51: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-51

PowerPC 601 Pipeline StructurePowerPC 601 Pipeline StructurePowerPC

Page 52: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-52

Dispatch UnitDispatch Unit• Dispatch unit takes instructions from the

cache and loads them into the dispatch queue• Instructions are dispatched according to the

following scheme– Branch processing unit– Floating-point unit– Integer unit

• Dispatch unit contains logic that enables it to calculate the prefetch address

PowerPC

Page 53: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-53

Instruction PipelinesInstruction Pipelines• There is a common fetch cycle for all

instructions• The second cycle begins with the dispatch of

an instruction to a particular unit• For branch instructions, the second cycle

involves decoding and executing instruction as well as predicting branches

• Floating-point instructions following a similar pipeline, but there are two execute cycle, reflecting the complexity of floating-point operations

PowerPC

Page 54: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-54

PowerPC 601 PipelinePowerPC 601 PipelinePowerPC

Page 55: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-55

Instruction PipelineInstruction Pipeline• The compiler can transform the sequence

comparebranchcomparebranch…

To the sequencecomparecompare…branchbranch…

PowerPC

Page 56: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-56

Branch ProcessingBranch Processing• To achieve zero-cycle branching, the

following strategies are employed:1. Logic is provided to scan through the dispatch

buffer for branches 2. An attempt is made to determine the outcome of

conditional branches. In any case as soon as a branch instruction is encountered, logic determines if the brancha. Will be takenb. Will not be takenc. Outcome cannot yet be determined

PowerPC

Page 57: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-57

Conditional BranchConditional Branch• (a) C code

if (a>0)a=a+b+c+d+e;

elsea=a-b-c-d-e;

PowerPC

Page 58: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-58

Conditional BranchConditional Branch• (b) Assembly code

PowerPC

Page 59: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-59

Branch Prediction:Not TakenBranch Prediction:Not Taken• (a) Correct prediction: Branch was not taken

PowerPC

Page 60: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-60

Branch Prediction:Not TakenBranch Prediction:Not Taken• (b) Incorrect prediction: Branch was taken

PowerPC

Page 61: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-61

PowerPC 620PowerPC 620• The 620 is the first 64-bit implementation of

the PowerPC architecture• A notable feature of this implementation is

that it include six independent execution units– Instruction unit– Three integer units– Load/store unit– Floating-point unit

• The 620 employs a high-performance branch prediction strategy that involves prediction logic, register rename buffer, and reservation stations inside the execution units

PowerPC

Page 62: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-62

MIPS R10000MIPS R10000• MIPS R10000, which has evolved from the

MIPS R4000 and uses the same instruction set, is a rather clean, straightforward implementation of superscalar design principles

MIPS R1000

Page 63: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-63

MIPS 10000 StructureMIPS 10000 StructureMIPS R10000

Page 64: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-64

Internal OrganizationInternal Organization• An external L2 cache feeds separate L1

instruction and data caches• The prefetch and dispatch init(PDU)

instructions into an instruction buffer that can hold up to 12 instructions

• The integer execution unit contains two complete ALUs and eight register windows

• The load/store unit(LSU) generates the virtual address of all loads and stores and supports one load or store per cycle

UltraSPARC-II

Page 65: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-65

UltraSPARCUltraSPARC--II Block DiagramII Block DiagramUltraSPARC-II

Page 66: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-66

PipelinePipeline• Nine-stage instruction pipeline

– Fetch– Decode– Group– Execute– Cache– N1– N2– N3– Write

UltraSPARC-II

Page 67: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-67

UltraSPARCUltraSPARC II Instruction PipelineII Instruction PipelineUltraSPARC-II

Page 68: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-68

IAIA--64/MERCED64/MERCED• The basic concepts underlying IA-64

– Instruction-level parallelism– Long or very long instruction words– Branch predication(not the same thing as branch

prediction)– Speculative loading

IA-64/MERCED

Page 69: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-69

TranditionalTranditional SupersclarSupersclar vsvs IAIA--6464

Speculatively loads data before its needed, and still tries to find data in the cashes first

Loads data from memory only when needed and tries to fine the data in the caches first

Speculative execution along both paths of a branch

Branch prediction with speculative execution of one path

Reorders and optimizes instruction stream at compile time

Reorders and optimizes instruction stream at run time

Multiple parallel execution unitsMultiple parallel execution units

RISC-line instructions bundled into groups of three

RISC-line instructions, one per word

IA-64Superscalar

IA-64/MERCED

Page 70: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-70

MotivationMotivation• For Intel, the move to a new architecture, one

that is not hardware compatible with the x86 instruction architecture, is a momentous decision

• Processor designers have few choices in how to use this glut of transistors

• One approach is to dump those extra transistors into bugger on-chip caches

IA-64/MERCED

Page 71: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-71

Organization Organization • IA-64 can be implemented in a variety of

organizations • Key features

– A generous number of registers– Multiple execution units

• The register file is quite large compared to most RISC and superscalar machines

• The number of execution units is a function of the number of transistors available in a particular implementation

• The processor will exploit parallelism to the extent that is can

IA-64/MERCED

Page 72: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-72

General Organization for IAGeneral Organization for IA--64 64 IA-64/MERCED

Page 73: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-73

Instruction FormatInstruction Format• IA-64 defines a 128-bit bundle that contains

three instructions and a template field • The bundled instructions do not have to be

in the original program order• Each instruction has a fixed-length 40-bit

format• IA-64 makes use of more registers than a

typical RISC machine:128 integer and 128 floating-point registers

• The accommodate the predicated execution technique, an IA-64 machine includes 64 predicate registers

IA-64/MERCED

Page 74: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-74

IAIA--64 Instruction Format64 Instruction FormatIA-64/MERCED

Page 75: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-75

Predicated ExecutionPredicated Execution• Prediction is a technique whereby the

compiler determines which instruction may execute in parallel

• An IA-64 compiler instead does the following1. At the if point in the program, insert a compare that

creates two predicates2. Augment each instruction in the then path with a

reference to a predicate register 3. The processor executes instructions along both paths

IA-64/MERCED

Page 76: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-76

IAIA--64 Predication&Speculative64 Predication&SpeculativeIA-64/MERCED

(a) Predication

Page 77: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-77

IAIA--64 Predication&Speculative64 Predication&SpeculativeIA-64/MERCED

(b) Speculativeloading

Page 78: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-78

Example of PredictionExample of PredictionIA-64/MERCED

Page 79: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-79

Speculative LoadSpeculative Load• Enable the processor to load data from

memory before the program needs it, to avoid memory latency delays

• Processor postpones the reporting of exceptions until it becomes necessary to report the exception

• Rearrange the code so that loads are done as early as possible

IA-64/MERCED

Page 80: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-80

The Eight Queens ProblemThe Eight Queens ProblemIA-64/MERCED

Page 81: Chapter 13 Instruction-Level Parallelism and Superscalar ...soc.yonsei.ac.kr/class/material/computersystems/2003/chapter13.pdf · • Intel 80486 : because there are two pipeline

YonseiYonsei UniversityUniversity13-81

Speculative LoadSpeculative Load• A load instruction in the original program is

replaced by two instructions– A speculative load executes the memory fetch,

perform exception detection, but does deliver the exception

– A checking instruction remains in the place of the original load and delivers exceptions

IA-64/MERCED