Digital Design and Computer Architecture: ARM®...

Preview:

Citation preview

Chapter7<1>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Chapter7

DigitalDesignandComputerArchitecture:ARM®Edi*onSarahL.HarrisandDavidMoneyHarris

Chapter7<2>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Chapter7::Topics

•  Introduc*on•  PerformanceAnalysis•  Single-CycleProcessor•  Mul*cycleProcessor•  PipelinedProcessor•  AdvancedMicroarchitecture

Chapter7<3>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Microarchitecture:howtoimplementanarchitectureinhardware

•  Processor:– Datapath:func>onalblocks–  Control:controlsignals

Introduc>on

Chapter7<4>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Mul>pleimplementa>onsforasinglearchitecture:– Single-cycle:Eachinstruc>onexecutesinasinglecycle

– Mul*cycle:Eachinstruc>onisbrokenupintoseriesofshortersteps

– Pipelined:Eachinstruc>onbrokenupintoseriesofsteps&mul>pleinstruc>onsexecuteatonce

Microarchitecture

Chapter7<5>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Programexecu*on*me

Execu*onTime=(#instruc*ons)(cycles/instruc*on)(seconds/cycle)

•  Defini*ons:–  CPI:Cycles/instruc>on–  clockperiod:seconds/cycle–  IPC:instruc>ons/cycle=IPC

•  Challengeistosa*sfyconstraintsof:–  Cost–  Power–  Performance

ProcessorPerformance

Chapter7<6>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  ConsidersubsetofARMinstruc>ons:– Data-processinginstruc*ons:

•  ADD,SUB,AND,ORR •  withregisterandimmediateSrc2,butnoshiLs

– Memoryinstruc*ons:•  LDR,STR •  withposi*veimmediateoffset

–  Branchinstruc*ons:•  B

ARMProcessor

Chapter7<7>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Review:Instruc>onFormats

Branch

Chapter7<8>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Determineseverythingaboutaprocessor:– Architecturalstate:

•  16registers(includingPC)•  Statusregister

– Memory

ArchitecturalStateElements

Chapter7<9>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

CLK

A RD

InstructionMemory

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RDData

MemoryWD

WEPCPC'

CLK

R15

CLK

Status

32 32 32 32

32

32

32

3232

32

32

4

4

4

4 4

ARMArchitecturalStateElements

Chapter7<10>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Datapath•  Control

Single-CycleARMProcessor

Chapter7<11>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Datapath•  Control

Single-CycleARMProcessor

Chapter7<12>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Datapath:startwithLDRinstruc>on•  Example: LDR R1, [R2, #5] LDR Rd, [Rn, imm12]

Single-CycleARMProcessor

Chapter7<13>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

STEP1:Fetchinstruc>on

CLK

A RD

InstructionMemory

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RDData

MemoryWD

WEPCPC'

Instr

CLK

R15

Single-CycleDatapath:LDRfetch

Chapter7<14>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

STEP2:ReadsourceoperandsfromRF

CLK

A RD

InstructionMemory

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RDData

MemoryWD

WEPCPC'

Instr 19:16

CLK

R15

RA1

Single-CycleDatapath:LDRRegRead

LDR Rd, [Rn, imm12]

Chapter7<15>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

STEP3:Extendtheimmediate

ExtImm

CLK

A RD

InstructionMemory

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RDData

MemoryWD

WEPCPC'

Instr 19:16

15:12

11:0

CLK

R15

RA1

Extend

Single-CycleDatapath:LDRImmed.

LDR Rd, [Rn, imm12]

Chapter7<16>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

STEP4:Computethememoryaddress

ExtImm

CLK

A RD

InstructionMemory

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RDData

MemoryWD

WEPCPC'

Instr 19:16

15:12

11:0

SrcB

ALUResult

SrcA

CLK

ALU

R15

RA1

Extend

ALUControl00

Single-CycleDatapath:LDRAddress

LDR Rd, [Rn, imm12]

Chapter7<17>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

LDR Rd, [Rn, imm12]

STEP5:Readdatafrommemoryandwriteitbacktoregisterfile

ExtImm

CLK

A RD

InstructionMemory

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RDData

MemoryWD

WEPCPC'

Instr 19:16

15:12

11:0

SrcB

ALUResult ReadData

SrcA

CLK

ALU

R15

RA1

Extend

RegWrite ALUControl1 00

Single-CycleDatapath:LDRMemRead

Chapter7<18>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

STEP6:Determineaddressofnextinstruc>on

ExtImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RDData

MemoryWD

WEPCPC'

Instr 19:16

15:12

11:0

SrcB

ALUResult ReadData

SrcA

PCPlus4

CLK

ALU

R15

RA1

Extend

RegWrite ALUControl1 00

o

Single-CycleDatapath:PCIncrement

Chapter7<19>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

PCcanbesource/des>na>onofinstruc>on

ExtImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RDData

MemoryWD

WEPC1

0PC'

Instr 19:16

15:12

11:0

SrcB

ALUResult ReadData

SrcA

PCPlus4

CLK

ALU

PCPlus8 R15+

4

RA1

Extend

RegWritePCSrc ALUControl1 1 00

Single-CycleDatapath:AccesstoPC

Chapter7<20>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

PCcanbesource/des>na>onofinstruc>on•  Source:R15mustbeavailableinRegisterFile

–  PCisreadasthecurrentPCplus8

ExtImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RDData

MemoryWD

WEPC1

0PC'

Instr 19:16

15:12

11:0

SrcB

ALUResult ReadData

SrcA

PCPlus4

CLK

ALU

PCPlus8 R15+

4

RA1

Extend

RegWritePCSrc ALUControl1 1 00

Single-CycleDatapath:AccesstoPC

Chapter7<21>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

PCcanbesource/des>na>onofinstruc>on•  Source:R15mustbeavailableinRegisterFile

–  PCisreadasthecurrentPCplus8•  Des*na*on:BeabletowriteresulttoPC

ExtImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RDData

MemoryWD

WEPC1

0PC'

Instr 19:16

15:12

11:0

SrcB

ALUResult ReadData

SrcA

PCPlus4

CLK

ALU

PCPlus8 R15+

4

RA1

Extend

RegWritePCSrc ALUControl1 1 00

Single-CycleDatapath:AccesstoPC

Chapter7<22>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

ExpanddatapathtohandleSTR:•  WritedatainRdtomemory

ExtImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RDData

MemoryWD

WEPC1

0PC'

Instr 19:16

15:12

11:0

SrcB

ALUResult ReadData

WriteData

SrcA

PCPlus4

CLK

ALU

PCPlus8 R15+

4

RA1

RA2

Extend

RegWritePCSrc MemWriteALUControl

0 0 00 1

Single-CycleDatapath:STR

STR Rd, [Rn, imm12]

Chapter7<23>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

WithimmediateSrc2:•  ReadfromRnandImm8(ImmSrcchoosesthezero-extendedImm8

insteadofImm12)•  WriteALUResulttoregisterfile•  WritetoRd

Single-CycleDatapath:Data-processing

ADD Rd, Rn, imm8

Chapter7<24>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

WithimmediateSrc2:•  ReadfromRnandImm8(ImmSrcchoosesthezero-extendedImm8

insteadofImm12)•  WriteALUResulttoregisterfile•  WritetoRd

ExtImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RDData

MemoryWD

WE

10

PC10

PC'

Instr 19:16

15:12

11:0

SrcB

ALUResult ReadData

WriteData

SrcA

PCPlus4

Result

ALUFlags

CLK

ALU

PCPlus8 R15+

4

RA1

RA2

Extend

RegWritePCSrc ImmSrc MemWrite MemtoRegALUControl

0 1 0 varies 0 0

Single-CycleDatapath:Data-processing

ADD Rd, Rn, imm8

Chapter7<25>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

WithregisterSrc2:•  ReadfromRnandRm(insteadofImm8) •  WriteALUResulttoregisterfile•  WritetoRd

Single-CycleDatapath:Data-processing

ADD Rd, Rn, Rm

Chapter7<26>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

WithregisterSrc2:•  ReadfromRnandRm(insteadofImm8) •  WriteALUResulttoregisterfile•  WritetoRd

ExtImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PC10

PC'

Instr 19:16

15:12

11:0

SrcB

ALUResult ReadData

WriteData

SrcA

PCPlus4

Result

ALUFlags

CLK

ALU

PCPlus8 R15

3:0

+

4

RA1

RA2

Extend

01

RegSrc RegWritePCSrc ImmSrc MemWrite MemtoRegALUControlALUSrc

0 1 X 0 varies 0 00

Single-CycleDatapath:Data-processing

ADD Rd, Rn, Rm

Chapter7<27>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Calculatebranchtargetaddress: BTA=(ExtImm)+(PC+8)

ExtImm=Imm24<<2andsign-extended

Single-CycleDatapath:B

ExtImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PC10

PC'

Instr

19:16

15:12

23:0

SrcB

ALUResult ReadData

WriteData

SrcA

PCPlus4

Result

ALUFlags

CLK

ALU

PCPlus8 R15

3:0+

4

15RA1

RA2

Extend

01

01

RegSrc RegWritePCSrc ImmSrc MemWrite MemtoRegALUControlALUSrc

11 0 10 1 00 0 0x

B Label

Chapter7<29>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Single-CycleARMProcessor

ExtImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PC10

PC'

Instr

19:16

15:12

23:0

25:20

SrcB

ALUResult ReadData

WriteData

SrcA

PCPlus4

Result

27:26

ImmSrc

PCSrc

MemWriteMemtoReg

ALUSrc

RegWrite

OpFunct

ControlUnit

ALUFlags

CLK

ALUControl

ALU

PCPlus8 R15

3:0

Cond31:28

Flags

15:12 Rd

+

4

15RA1

RA2

0 1

Extend

01

01

RegSrc

Chapter7<66>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Example:ORR

Chapter7<73>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

ProgramExecu*onTime=(#instruc>ons)(cycles/instruc>on)(seconds/cycle)=#instruc>onsxCPIxTC

Review:ProcessorPerformance

Chapter7<74>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

TClimitedbycri*calpath(LDR)

Single-CyclePerformance

Chapter7<75>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Single-cyclecri*calpath: Tc1 = tpcq_PC + tmem + tdec + max[tmux + tRFread, tsext +

tmux] + tALU + tmem + tmux + tRFsetup

•  Typically,limi*ngpathsare:– memory,ALU,registerfile–  Tc1 = tpcq_PC + 2tmem + tdec + tRFread + tALU + 2tmux +

tRFsetup

Single-CyclePerformance

Chapter7<76>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Element Parameter Delay(ps)Registerclock-to-Q tpcq_PC 40 Registersetup tsetup 50 Mul>plexer tmux 25 ALU tALU 120 Decoder tdec 70 Memoryread tmem 200 Registerfileread tRFread 100 Registerfilesetup tRFsetup 60

Tc1 = ?

Single-CyclePerformanceExample

Chapter7<77>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Tc1 = tpcq_PC + 2tmem + tdec + tRFread + tALU + 2tmux + tRFsetup = [50 + 2(200) + 70 + 100 + 120 + 2(25) + 60] ps = 840 ps

Single-CyclePerformanceExampleElement Parameter Delay(ps)Registerclock-to-Q tpcq_PC 40 Registersetup tsetup 50 Mul>plexer tmux 25 ALU tALU 120 Decoder tdec 70 Memoryread tmem 200 Registerfileread tRFread 100 Registerfilesetup tRFsetup 60

Chapter7<78>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Programwith100billioninstruc>ons: Execu*onTime=#instruc>onsxCPIxTC =(100×109)(1)(840×10-12s) =84seconds

Single-CyclePerformanceExample

Chapter7<79>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Single-cycle:+simple-  cycle>melimitedbylongestinstruc>on(LDR)-  separatememoriesforinstruc>onanddata-  3adders/ALUs

•  Mul*cycleprocessoraddressestheseissuesbybreakinginstruc*onintoshorterstepso shorterinstruc>onstakefewerstepso canre-usehardwareo cycle>meisfaster

Mul>cycleARMProcessor

Chapter7<80>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Single-cycle:+simple-  cycle>melimitedbylongestinstruc>on(LDR)-  separatememoriesforinstruc>onanddata-  3adders/ALUs

•  Mul*cycle:+higherclockspeed+simplerinstruc>onsrunfaster+reuseexpensivehardwareonmul>plecycles-sequencingoverheadpaidmany>mes

Mul>cycleARMProcessor

Chapter7<81>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Single-cycle:+simple-  cycle>melimitedbylongestinstruc>on(LDR)-  separatememoriesforinstruc>onanddata-  3adders/ALUs

•  Mul*cycle:+higherclockspeed+simplerinstruc>onsrunfaster+reuseexpensivehardwareonmul>plecycles-sequencingoverheadpaidmany>mes

Mul>cycleARMProcessor

Samedesignstepsassingle-cycle:•  firstdatapath•  thencontrol

Chapter7<82>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

ReplaceInstruc>onandDatamemorieswithasingleunifiedmemory–morerealis>c

Mul>cycleStateElements

Chapter7<83>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

STEP1:Fetchinstruc>on

Mul>cycleDatapath:Instruc>onFetch

LDR Rd, [Rn, imm12]

Chapter7<84>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

LDR Rd, [Rn, imm12]

Mul>cycleDatapath:LDRRegisterRead

STEP2:ReadsourceoperandsfromRF

Chapter7<85>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

LDR Rd, [Rn, imm12]

Mul>cycleDatapath:LDRAddress

STEP3:Computethememoryaddress

Chapter7<86>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

LDR Rd, [Rn, imm12]

Mul>cycleDatapath:LDRMemoryRead

STEP4:Readdatafrommemory

Chapter7<87>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

LDR Rd, [Rn, imm12]

Mul>cycleDatapath:LDRWriteRegister

STEP5:Writedatabacktoregisterfile

Chapter7<88>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Mul>cycleDatapath:IncrementPC

STEP6:IncrementPC

Chapter7<89>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Mul>cycleDatapath:AccesstoPC

PCcanberead/wrijenbyinstruc>on

Chapter7<90>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Mul>cycleDatapath:AccesstoPC

PCcanberead/wrijenbyinstruc>on•  Read:R15(PC+8)availableinRegisterFile

Chapter7<91>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Mul>cycleDatapath:ReadtoPC(R15)

Example:ADD R1, R15, R2

Chapter7<92>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Mul>cycleDatapath:ReadtoPC(R15)

Example:ADD R1, R15, R2 •  R15needstobereadasPC+8fromRegisterFile(RF)in2ndstep•  So(alsoin2ndstep)PC+8isproducedbyALUandroutedtoR15

inputofRF

Chapter7<93>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Mul>cycleDatapath:ReadtoPC(R15)

Example:ADD R1, R15, R2 •  R15needstobereadasPC+8fromRegisterFile(RF)in2ndstep•  So(alsoin2ndstep)PC+8isproducedbyALUandroutedtoR15

inputofRF–  SrcA=PC(whichwasalreadyupdatedinstep1toPC+4)–  SrcB=4–  ALUResult=PC+8

•  ALUResultisfedtoR15inputportofRFin2ndstep(whichisthenroutedtoRD1outputofRF)

Chapter7<94>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Mul>cycleDatapath:ReadtoPC(R15)

Example:ADD R1, R15, R2 •  R15needstobereadasPC+8fromRegisterFile(RF)in2ndstep•  So(alsoin2ndstep)PC+8isproducedbyALUandroutedtoR15

inputofRF

Chapter7<95>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Mul>cycleDatapath:AccesstoPC

PCcanberead/wrijenbyinstruc>on•  Read:R15(PC+8)availableinRegisterFile•  Write:Beabletowriteresultofinstruc>ontoPC

Chapter7<96>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Mul>cycleDatapath:WritetoPC(R15)

Example:SUB R15, R8, R3

Chapter7<97>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Mul>cycleDatapath:WritetoPC(R15)

Example:SUB R15, R8, R3 •  Resultofinstruc>onneedstobewrijentothePCregister•  ALUResultalreadyroutedtothePCregister,justassertPCWrite

Chapter7<98>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Mul>cycleDatapath:WritetoPC(R15)

Example:SUB R15, R8, R3 •  Resultofinstruc>onneedstobewrijentothePCregister•  ALUResultalreadyroutedtothePCregister,justassertPCWrite

Chapter7<99>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

WritedatainRn tomemory

Mul>cycleDatapath:STR

Chapter7<100>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Withimmediateaddressing(i.e.,animmediateSrc2),noaddi>onalchangesneededfordatapath

Mul>cycleDatapath:Data-processing

Chapter7<101>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Withregisteraddressing(registerSrc2):ReadfromRnandRm

Mul>cycleDatapath:Data-processing

Chapter7<102>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Calculatebranchtargetaddress: BTA=(ExtImm)+(PC+8)

ExtImm=Imm24<<2andsign-extended

Mul>cycleDatapath:B

Chapter7<103>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Mul>cycleARMProcessor

Chapter7<111>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

MainControllerFSM:Fetch

Chapter7<112>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

MainControllerFSM:Decode

Chapter7<113>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

MainControllerFSM:Address

Chapter7<114>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

MainControllerFSM:ReadMemory

Chapter7<116>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

MainControllerFSM:LDR

Chapter7<117>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

MainControllerFSM:STR

Chapter7<118>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

MainControllerFSM:Data-processing

Chapter7<119>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

MainControllerFSM:Data-processing

Chapter7<120>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Mul>cycleControllerFSM

Chapter7<125>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Instruc>onstakedifferentnumberofcycles.

Mul>cycleProcessorPerformance

Chapter7<126>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Mul>cycleControllerFSM

Chapter7<127>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Instruc>onstakedifferentnumberofcycles:–  3cycles: –  4cycles: –  5cycles:

Mul>cycleProcessorPerformance

Chapter7<128>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Instruc>onstakedifferentnumberofcycles:–  3cycles:B –  4cycles:DP, STR –  5cycles: LDR

Mul>cycleProcessorPerformance

Chapter7<129>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Instruc>onstakedifferentnumberofcycles:–  3cycles:B –  4cycles:DP, STR –  5cycles: LDR

•  CPIisweightedaverage•  SPECINT2000benchmark:

–  25%loads–  10%stores–  13%branches–  52%R-type

Mul>cycleProcessorPerformance

Chapter7<130>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Instruc>onstakedifferentnumberofcycles:–  3cycles:B –  4cycles:DP, STR –  5cycles: LDR

•  CPIisweightedaverage•  SPECINT2000benchmark:

–  25%loads–  10%stores–  13%branches–  52%R-type

Average CPI = (0.13)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12

Mul>cycleProcessorPerformance

Chapter7<131>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Mul>cyclecri>calpath:•  Assump>ons:•  RFisfasterthanmemory•  wri>ngmemoryisfasterthanreadingmemory

Tc2 = tpcq + 2tmux + max(tALU + tmux, tmem) + tsetup

Mul>cycleProcessorPerformance

Chapter7<132>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Tc2 = ?

Mul>cyclePerformanceExampleElement Parameter Delay(ps)Registerclock-to-Q tpcq_PC 40

Registersetup tsetup 50

Mul>plexer tmux 25

ALU tALU 120

Decoder tdec 70

Memoryread tmem 200

Registerfileread tRFread 100

Registerfilesetup tRFsetup 60

Chapter7<133>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Tc2 = tpcq + 2tmux + max[tALU + tmux, tmem] + tsetup = [40 + 2(25) + 200 + 50] ps = 340 ps

Mul>cyclePerformanceExampleElement Parameter Delay(ps)Registerclock-to-Q tpcq_PC 40

Registersetup tsetup 50

Mul>plexer tmux 25

ALU tALU 120

Decoder tdec 70

Memoryread tmem 200

Registerfileread tRFread 100

Registerfilesetup tRFsetup 60

Chapter7<134>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Foraprogramwith100billioninstruc>onsexecu>ngonamul*cycleARMprocessor

– CPI=4.12cycles/instruc>on– Clockcycle*me:Tc2=340ps

Execu*onTime=?

Mul>cyclePerformanceExample

Chapter7<135>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Foraprogramwith100billioninstruc>onsexecu>ngonamul*cycleARMprocessor

– CPI=4.12cycles/instruc>on– Clockcycle*me:Tc2=340ps

Execu*onTime=(#instruc>ons)×CPI×Tc =(100×109)(4.12)(340×10-12) =140seconds

Mul>cyclePerformanceExample

Chapter7<136>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Foraprogramwith100billioninstruc>onsexecu>ngonamul*cycleARMprocessor

– CPI=4.12cycles/instruc>on– Clockcycle*me:Tc2=340ps

Execu*onTime=(#instruc>ons)×CPI×Tc =(100×109)(4.12)(340×10-12) =140seconds

Thisisslowerthanthesingle-cycleprocessor(84sec.)

Mul>cyclePerformanceExample

Chapter7<137>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Review:Single-CycleARMProcessor

ExtImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PC10

PC'

Instr

19:16

15:12

23:0

25:20

SrcB

ALUResult ReadData

WriteData

SrcA

PCPlus4

Result

27:26

ImmSrc

PCSrc

MemWriteMemtoReg

ALUSrc

RegWrite

OpFunct

ControlUnit

ALUFlags

CLK

ALUControl

ALU

PCPlus8 R15

3:0

Cond31:28

Flags

15:12 Rd

+

4

15RA1

RA2

0 1

Extend

01

01

RegSrc

Chapter7<138>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Review:Mul>cycleARMProcessor

Chapter7<139>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Temporalparallelism•  Dividesingle-cycleprocessorinto5stages:

–  Fetch–  Decode–  Execute– Memory– Writeback

•  Addpipelineregistersbetweenstages

PipelinedARMProcessor

Chapter7<140>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Single-Cyclevs.Pipelined

Time(ps)Instr

FetchInstruction

DecRead Reg

ExecuteALU

MemoryRead/Write

WrReg1

2

0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 15001000

Instr

1

2

(b)

3

FetchInstruction

DecRead Reg

ExecuteALU

MemoryRead/Write

WrReg

FetchInstruction

DecRead Reg

ExecuteALU

MemoryRead/Write

WrReg

FetchInstruction

DecRead Reg

ExecuteALU

MemoryRead/Write

WrReg

FetchInstruction

DecRead Reg

ExecuteALU

MemoryRead/Write

WrReg

Single-Cycle

Pipelined

Chapter7<141>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

PipelinedProcessorAbstrac>on

Time(cycles)

LDR R2, [R0, #40] RF 40

R0RF

R2+ DM

RF R10

R9RF

R3+ DM

RF R5

R1RF

R4- DM

RF R13

R12RF

R5& DM

RF 20

R1RF

R6+ DM

RF 42

R11RF

R7| DM

ADD R3, R9, R10

SUB R4, R1, R5

AND R5, R12, R13

STR R6, [R1, #20]

ORR R7, R11, #42

1 2 3 4 5 6 7 8 9 10

ADD

IM

IM

IM

IM

IM

IM LDR

SUB

AND

STR

ORR

Chapter7<142>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Single-Cycle&PipelinedDatapath

ExtImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PC10

PC'

Instr

19:16

15:12

23:0

SrcB

ALUResult ReadData

WriteData

SrcA

PCPlus4

Result

CLK

ALU

PCPlus8 R15

3:0

+

4

15RA1

RA2

Extend

01

01

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF10

PC'

InstrD

19:16

15:12

23:0

SrcBE

ALUResultE ReadDataW

WriteDataE

SrcAE

PCPlus4F

ResultW

CLK

ALU

PCPlus8 R15

3:0

+

4

15RA1D

RA2D

Extend

01

01

CLK CLK CLK CLK

Fetch Decode Execute Memory Writeback

InstrF

ALUOutM ALUOutW

WA3D

Single-Cycle

Pipelined

Chapter7<143>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  WA3mustarriveatsame*measResult•  Registerfilewri]enonfallingedgeofCLK

CorrectedPipelinedDatapath

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF10

PC'

InstrD

19:16

15:12

23:0

SrcBE

ALUResultE ReadDataW

WriteDataE

SrcAE

PCPlus4F

ResultW

CLK

ALU

PCPlus8

R15

3:0

+

4

15RA1D

RA2D

Extend

01

01

CLK CLK CLK CLK

InstrF

ALUOutM ALUOutWWA3E WA3M WA3WWA3D

Chapter7<144>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

RemoveadderbyusingPCPlus4FaLerPChasbeenupdatedtoPC+4

Op>mizedPipelinedDatapath

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF10

PC'

InstrD

19:16

15:12

23:0

SrcBE

ALUResultE ReadDataW

WriteDataE

SrcAE

PCPlus4F

ResultW

CLK

ALU

R15

3:015

RA1D

RA2D

Extend

01

01

CLK CLK CLK CLK

InstrF

ALUOutM ALUOutWWA3E WA3M WA3WWA3D

PCPlus8D

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF10

PC'

InstrD

19:16

15:12

23:0

SrcBE

ALUResultE ReadDataW

WriteDataE

SrcAE

PCPlus4F

ResultW

CLK

ALU

PCPlus8

R15

3:0

+

4

15RA1D

RA2D

Extend

01

01

CLK CLK CLK CLK

InstrF

ALUOutM ALUOutWWA3E WA3M WA3WWA3D

Chapter7<145>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Samecontrolunitassingle-cycleprocessor•  Controldelayedtoproperpipelinestage

PipelinedProcessorControl

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WE

10

PCFPC'

InstrD

19:16

15:12

23:0

25:20

SrcBE

ALUResultE ReadDataW

WriteDataE

SrcAE

PCPlus4F

ResultW

27:26

ImmSrcD

MemWriteDMemtoRegD

ALUSrcD

RegWriteD

OpFunct

ControlUnit

ALUFlags

CLK

ALUControlD

ALU

PCPlus8D

R15

3:0

31:28

FlagWriteD

15:12 Rd

15RA1D

RA2D

0 1

Extend

01

01

RegSrcD

CLK

InstrF

CLK

ALUOutM ALUOutWWA3E WA3M WA3W

CLK CLK

MemWriteE

MemtoRegE

ALUSrcE

RegWriteE

ALUControlEMemWriteMMemtoRegMRegWriteM

MemtoRegWRegWriteW

BranchD

FlagsE

FlagWriteE

BranchE

CondE

CondExE

10

PCSrcD PCSrcE PCSrcM PCSrcW

Flags'CondUnit

Chapter7<146>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Whenaninstruc>ondependsonresultfrominstruc>onthathasn’tcompleted

•  Types:– Datahazard:registervaluenotyetwrijenbacktoregisterfile

– Controlhazard:nextinstruc>onnotdecidedyet(causedbybranch)

PipelineHazards

Chapter7<147>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

DataHazard

Time(cycles)

ADD R1, R4, R5 RF R5

R4RF

R1+ DM

RF R3

R1RF

R8& DM

RF R1

R6RF

R9| DM

RF R7

R1RF

R10- DM

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

1 2 3 4 5 6 7 8

AND

IM

IM

IM

IM ADD

ORR

SUB

Chapter7<148>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  InsertNOPsincodeatcompile>me•  Rearrangecodeatcompile>me•  Forwarddataatrun>me•  Stalltheprocessoratrun>me

HandlingDataHazards

Chapter7<149>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  InsertenoughNOPsforresulttobeready•  Ormoveindependentusefulinstruc>onsforward

Compile-TimeHazardElimina>on

Time(cycles)

ADD R1, R4, R5 RF R5

R4RF

R1+ DM

RF R3

R1RF

R8& DM

RF R1

R6RF

R9| DM

RF R7

R1RF

R10- DM

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

1 2 3 4 5 6 7 8

AND

IM

IM

IM

IM ADD

ORR

SUB

NOP

NOP

RF RFDMNOPIM

RF RFDMNOPIM

9 10

Chapter7<150>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

DataForwarding

Time(cycles)

ADD R1, R4, R5 RF R5

R4RF

R1+ DM

RF R3

R1RF

R8& DM

RF R1

R6RF

R9| DM

RF R7

R1RF

R10- DM

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

1 2 3 4 5 6 7 8

AND

IM

IM

IM

IM ADD

ORR

SUB

Chapter7<151>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

DataForwarding

•  CheckifregisterreadinExecutestagematchesregisterwrijeninMemoryorWritebackstage

•  Ifso,forwardresult

Time(cycles)

ADD R1, R4, R5 RF R5

R4RF

R1+ DM

RF R3

R1RF

R8& DM

RF R1

R6RF

R9| DM

RF R7

R1RF

R10- DM

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

1 2 3 4 5 6 7 8

AND

IM

IM

IM

IM ADD

ORR

SUB

Chapter7<152>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

DataForwarding

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

RegisterFile

01

A RDData

MemoryWD

WE

10

PCFPC'

InstrD

19:16

15:12

23:0

25:20

SrcBE

ALUResultE ReadDataW

WriteDataE

SrcAE

PCPlus4F

ResultW

27:26

ImmSrcD

MemWriteDMemtoRegD

ALUSrcD

RegWriteD

OpFunct

ControlUnit

ALUFlags

CLK

ALUControlD

ALU

PCPlus8D

R15

3:0

31:28

FlagWriteD

15:12 Rd

15RA1D

RA2D

0 1

Extend

01

01

RegSrcD

CLK

InstrF

CLK

ALUOutM ALUOutWWA3E WA3M WA3W

CLK CLK

MemWriteE

MemtoRegE

ALUSrcE

RegWriteE

ALUControlEMemWriteMMemtoRegMRegWriteM

MemtoRegWRegWriteW

BranchD

FlagsE

FlagWriteE

BranchE

CondE

CondExE

10

PCSrcD PCSrcE PCSrcM PCSrcW

Flags'

CondUnit

000110

000110

HazardUnit

ForwardA

EForw

ardBE

RegW

riteM

Match

RegW

riteW

CLK

Chapter7<153>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

DataForwarding•  ExecutestageregistermatchesMemorystageregister?

Match_1E_M=(RA1E==WA3M)Match_2E_M=(RA2E==WA3M)

•  ExecutestageregistermatchesWritebackstageregister?Match_1E_W=(RA1E==WA3W)Match_2E_W=(RA2E==WA3W)

•  Ifitmatches,forwardresult:if(Match_1E_M•RegWriteM) ForwardAE=10;elseif(Match_1E_W•RegWriteW) ForwardAE=01;else ForwardAE=00;

Chapter7<154>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

DataForwarding•  ExecutestageregistermatchesMemorystageregister?

Match_1E_M=(RA1E==WA3M)Match_2E_M=(RA2E==WA3M)

•  ExecutestageregistermatchesWritebackstageregister?Match_1E_W=(RA1E==WA3W)Match_2E_W=(RA2E==WA3W)

•  Ifitmatches,forwardresult:if(Match_1E_M•RegWriteM) ForwardAE=10;elseif(Match_1E_W•RegWriteW) ForwardAE=01;else ForwardAE=00;

ForwardBEsamebutwithMatch2E

Chapter7<155>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Stalling

Time(cycles)

LDR R1, [R4, #40] RF 40

R4RF

R1+ DM

RF R3

R1RF

R8& DM

RF R1

R6RF

R9| DM

RF R7

R1RF

R10- DM

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

1 2 3 4 5 6 7 8

AND

IM

IM

IM

IM LDR

ORR

SUB

Trouble!

Chapter7<156>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Stalling

Time(cycles)

LDR R1, [R4, #40] RF 40

R4RF

R1+ DM

RF R3

R1RF

R8& DM

RF R1

R6RF

R9| DM

RF R7

R1RF

R10- DM

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

1 2 3 4 5 6 7 8

AND

IM

IM

IM

IM LDR

ORR

SUB

9

RF R3

R1

IM ORR

Stall

Chapter7<157>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

StallingHardware

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

RegisterFile

01

A RDData

MemoryWD

WE

10

PCFPC'

InstrD

19:16

15:12

23:0

25:20

SrcBE

ALUResultE ReadDataW

WriteDataE

SrcAE

PCPlus4F

ResultW

27:26

ImmSrcD

MemWriteDMemtoRegD

ALUSrcD

RegWriteD

OpFunct

ControlUnit

ALUFlags

CLK

ALUControlD

ALU

PCPlus8D

R15

3:0

31:28

FlagWriteD

15:12 Rd

15RA1D

RA2D

0 1

Extend

01

01

RegSrcD

CLK

InstrF

CLK

ALUOutM ALUOutWWA3E WA3M WA3W

CLK CLK

MemWriteE

MemtoRegE

ALUSrcE

RegWriteE

ALUControlEMemWriteMMemtoRegMRegWriteM

MemtoRegWRegWriteW

BranchD

FlagsE

FlagWriteE

BranchE

CondECondExE

10

PCSrcD PCSrcE PCSrcM PCSrcW

Flags'

CondUnit

000110

000110

HazardUnit

ForwardA

EForw

ardBE

RegW

riteM

Match

RegW

riteW

MemtoRegE

StallF

StallD

FlushE

EN

CLR

CLREN

FlushD

CLK

Chapter7<158>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  IseithersourceregisterintheDecodestagethesameastheonebeingwrijenintheExecutestage?

Match_12D_E=(RA1D==WA3E)+(RA2D==WA3E)•  IsaLDRintheExecutestageANDMatch_12D_E?

ldrstall=Match_12D_E•MemtoRegEStallF=StallD=FlushE=ldrstall

StallingLogic

Chapter7<159>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  B:–  branchnotdeterminedun>ltheWritebackstageofpipeline

–  Instruc>onsaserbranchfetchedbeforebranchoccurs

–  These4instruc>onsmustbeflushedifbranchhappens

•  WritestoPC(R15)similar

ControlHazards

Chapter7<160>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

ControlHazardsTime(cycles)

B 3C RF RFDM

RF R3

R1RF& DM

RF R1

R6RF| DM

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

1 2 3 4 5 6 7 8

AND

IM

IM

IM B

ORR

20

24

28

2C

34... ...

9

Flushthese

instructions

64 ADD R12, R3, R4 RF R4

R3RF

R12+ DMIM ADD

RF R7

R1RF- DMIM SUB

RF R8

R1RF- DMIM SUBSUB R11, R1, R830

10

Branchmispredic*onpenalty•  numberofinstruc>onflushedwhenbranchistaken(4)•  MaybereducedbydeterminingBTAearlier

Chapter7<161>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

EarlyBranchResolu>on

•  DetermineBTAinExecutestage– Branchmispredic>onpenalty=2cycles

•  Hardwarechanges–  Addabranchmul>plexerbeforePCregistertoselectBTAfromALUResultE

–  AddBranchTakenEselectsignalforthismul>plexer(onlyassertedifbranchcondi>onsa>sfied)

–  PCSrcWnowonlyassertedforwritestoPC

Chapter7<162>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

PipelinedprocessorwithEarlyBTA

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF01

PC'

InstrD

19:16

15:12

23:0

25:20

SrcBE

ALUResultE ReadDataW

WriteDataE

SrcAE

PCPlus4F

ResultW

27:26

ImmSrcD

MemWriteDMemtoRegD

ALUSrcD

RegWriteD

OpFunct

ControlUnit

ALUFlags

CLK

ALUControlD

ALU

PCPlus8D

R15

3:0

31:28

FlagWriteD

15:12 Rd

15RA1D

RA2D

0 1

Extend

01

01

RegSrcD

CLK

InstrF

CLK

ALUOutM ALUOutW

000110

000110

WA3E WA3M WA3W

CLK CLK

MemWriteE

MemtoRegE

ALUSrcE

RegWriteE

ALUControlEMemWriteMMemtoRegMRegWriteM

MemtoRegWRegWriteW

BranchD

FlagsE

FlagWriteE

BranchE

CondECondExE

HazardUnit

StallF

StallD

FlushE

ForwardA

EForw

ardBE

EN

CLR

CLREN

10

PCSrcD PCSrcE PCSrcM PCSrcW

FlushD

Flags'CondUnit

BranchTakenE

RegW

riteM

Match

RegW

riteW

MemtoR

egECLK

Chapter7<163>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

ControlHazardswithEarlyBTATime(cycles)

B 3C RF RFDM

RF R3

R1RF& DM

RF R1

R6RF| DM

AND R8, R1, R3

ORR R9, R6, R1

SUB R10, R1, R7

1 2 3 4 5 6 7 8

AND

IM

IM

IM B

ORR

20

24

28

2C

34... ...

9

Flushthese

instructions

64 ADD R12, R3, R4 RF R4

R3RF

R12+ DMIM ADD

SUB R11, R1, R830

10

Chapter7<164>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  PCWrPendingF=1ifwritetoPCinDecode,ExecuteorMemory

PCWrPendingF=PCSrcD+PCSrcE+PCSrcM

•  StallFetchifPCWrPendingFStallF=ldrStallD+PCWrPendingF

•  FlushDecodeifPCWrPendingFORPCiswrijeninWritebackORbranchistaken

FlushD=PCWrPendingF+PCSrcW+BranchTakenE

•  FlushExecuteifbranchistakenFlushE=ldrStallD+BranchTakenE

•  StallDecodeifldrStallD(asbefore)StallD=ldrStallD

ControlStallingLogic

Chapter7<165>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

ARMPipelinedProcessorwithHazardUnit

ExtImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

RegisterFile

01

A RDData

MemoryWD

WE

10

PCF01

PC'

InstrD

19:16

15:12

23:0

25:20

SrcBE

ALUResultE ReadDataW

WriteDataE

SrcAE

PCPlus4F

ResultW

27:26

ImmSrcD

MemWriteDMemtoRegD

ALUSrcD

RegWriteD

OpFunct

ControlUnit

ALUFlags

CLK

ALUControlD

ALU

PCPlus8D

R15

3:0

31:28

FlagWriteD

15:12 Rd

15RA1D

RA2D

0 1

Extend

01

01

RegSrcD

CLK

InstrF

CLK

ALUOutM ALUOutW

000110

000110

WA3E WA3M WA3W

CLK CLK

MemWriteE

MemtoRegE

ALUSrcE

RegWriteE

ALUControlEMemWriteMMemtoRegMRegWriteM

MemtoRegWRegWriteW

BranchD

FlagsE

FlagWriteE

BranchE

CondECondExE

HazardUnit

StallF

StallD

FlushE

ForwardA

EForw

ardBE

EN

CLR

CLREN

10

PCSrcD PCSrcE PCSrcM PCSrcW

FlushD

Flags'CondUnit

BranchTakenE

RegW

riteM

Match

RegW

riteW

MemtoR

egECLK

Chapter7<166>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  SPECINT2000benchmark:–  25%loads–  10%stores–  13%branches–  52%R-type

•  Suppose:–  40%ofloadsusedbynextinstruc>on–  50%ofbranchesmispredicted

•  WhatistheaverageCPI?

PipelinedPerformanceExample

Chapter7<167>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  SPECINT2000benchmark:–  25%loads–  10%stores–  13%branches–  52%R-type

•  Suppose:–  40%ofloadsusedbynextinstruc>on–  50%ofbranchesmispredicted

•  WhatistheaverageCPI?–  LoadCPI=1whennotstalling,2whenstalling

So,CPIlw=1(0.6)+2(0.4)=1.4–  BranchCPI=1whennotstalling,3whenstalling

So,CPIbeq=1(0.5)+3(0.5)=2

Average CPI = (0.25)(1.4) + (0.1)(1) + (0.13)(2) + (0.52)(1) =1.23

PipelinedPerformanceExample

Chapter7<168>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Pipelined processor critical path: Tc3 = max [

tpcq + tmem + tsetup Fetch 2(tRFread + tsetup ) Decode tpcq + 2tmux + tALU + tsetup Execute tpcq + tmem + tsetup Memory 2(tpcq + tmux + tRFwrite) ] Writeback

PipelinedPerformance

Chapter7<169>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Element Parameter Delay(ps)Registerclock-to-Q tpcq_PC 40 Registersetup tsetup 50 Mul>plexer tmux 25 ALU tALU 120 Memoryread tmem 200 Registerfileread tRFread 100 Registerfilesetup tRFsetup 60 Registerfilewrite tRFwrite 70

Cycle*me: Tc3 = ?

PipelinedPerformanceExample

Chapter7<170>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Element Parameter Delay(ps)Registerclock-to-Q tpcq_PC 40 Registersetup tsetup 50 Mul>plexer tmux 25 ALU tALU 120 Memoryread tmem 200 Registerfileread tRFread 100 Registerfilesetup tRFsetup 60 Registerfilewrite tRFwrite 70

Cycle*me: Tc3 = 2(tRFread + tsetup ) = 2[100 + 50] ps = 300 ps

PipelinedPerformanceExample

Chapter7<171>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Programwith100billioninstruc>onsExecu*onTime =(#instruc>ons)×CPI×Tc =(100×109)(1.23)(300×10-12) =36.9seconds

PipelinedPerformanceExample

Chapter7<172>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Processor

Execu*onTime(seconds)

Speedup(single-cycleasbaseline)

Single-cycle 84 1

Mul*cycle 140 0.6

Pipelined 36.9 2.28

ProcessorPerformanceComparison

Chapter7<173>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  DeepPipelining•  Micro-opera>ons•  BranchPredic>on•  SuperscalarProcessors•  OutofOrderProcessors•  RegisterRenaming•  SIMD•  Mul>threading•  Mul>processors

AdvancedMicroarchitecture

Chapter7<174>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  10-20stagestypical•  Numberofstageslimitedby:– Pipelinehazards– Sequencingoverhead– Power– Cost

DeepPipelining

Chapter7<175>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Decomposemorecomplexinstruc>onsintoaseriesofsimpleinstruc>onscalledmicro-operaKons(micro-opsorµ-ops)

•  Atrun->me,complexinstruc>onsaredecodedintooneormoremicro-ops

•  UsedheavilyinCISC(complexinstruc>onsetcomputer)architectures(e.g.,x86)

•  UsedforsomeARMinstruc>ons,forexample:

ComplexOp Micro-opSequence LDR R1, [R2], #4 LDR R1, [R2] ADD R2, R2, #4

Withoutu-ops,wouldneed2ndwriteportontheregisterfile

Micro-opera>ons

Chapter7<176>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Allowfordensecode(fewermemoryaccesses)•  YetpreservesimplicityofRISChardware•  ARMstrikesbalancebychoosinginstruc>onsthat:

–  GivebejercodedensitythanpureRISCinstruc>onsets(suchasMIPS)

–  EnablemoreefficientdecodingthanCISCinstruc>onsets(suchasx86)

Micro-opera>ons

Chapter7<177>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Guesswhetherbranchwillbetaken– Backwardbranchesareusuallytaken(loops)– Considerhistorytoimproveguess

•  Goodpredic>onreducesfrac>onofbranchesrequiringaflush

BranchPredic>on

Chapter7<178>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Idealpipelinedprocessor:CPI=1•  Branchmispredic>onincreasesCPI•  Sta*cbranchpredic*on:–  Checkdirec>onofbranch(forwardorbackward)–  Ifbackward,predicttaken–  Else,predictnottaken

•  Dynamicbranchpredic*on:–  Keephistoryoflastseveralhundred(orthousand)branchesinbranchtargetbuffer,record:•  Branchdes>na>on•  Whetherbranchwastaken

BranchPredic>on

Chapter7<179>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

MOV R1, #0 ; R1 = sum

MOV R0, #0 ; R0 = i

FOR ; for (i=0; i<10; i=i+1)

CMP R0, #10

BGE DONE

ADD R1, R1, R0 ; sum = sum + i ADD R0, R0, #1

B FOR

DONE

BranchPredic>onExample

Chapter7<180>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Rememberswhetherbranchwastakenthelast>meanddoesthesamething

•  Mispredictsfirstandlastbranchofloop

1-BitBranchPredictor

Chapter7<181>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

Onlymispredictslastbranchofloop

stronglytaken

predicttaken

weaklytaken

predicttaken

weaklynot taken

predictnot taken

stronglynot taken

predictnot taken

taken taken taken

takentakentaken

taken

taken

2-BitBranchPredictor

Chapter7<182>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Mul>plecopiesofdatapathexecutemul>pleinstruc>onsatonce

•  Dependenciesmakeittrickytoissuemul>pleinstruc>onsatonce

CLK CLK CLK CLK

ARD A1

A2RD1A3

WD3WD6

A4A5A6

RD4

RD2RD5

InstructionMemory

RegisterFile Data

Memory

ALUs

PC

CLK

A1A2

WD1WD2

RD1RD2

Superscalar

Chapter7<183>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

IdealIPC: 2 ActualIPC: 2

SuperscalarExample

Time(cycles)

1 2 3 4 5 6 7 8

RF40

R0

RF

R8+

DMIM

LDR

ADD

LDR R8, [R0, #40]

ADD R9, R1, R2

SUB R10, R1, R3

AND R11, R3, R4

ORR R12, R1, R5

STR R5, [R0, #80]

R9R2

R1

+

RFR3

R1

RF

R10-

DMIM

SUB

AND R11R4

R3

&

RFR5

R1

RF

R12|

DMIM

ORR

STR 80

R0

+ R5

Chapter7<184>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

SuperscalarwithDependencies

Stall

Time(cycles)

1 2 3 4 5 6 7 8

RF40

R0

RF

R8+

DMIM

LDRLDR R8, [R0, #40]

ADD R9, R8, R1

SUB R8, R2, R3

AND R10, R4, R8

STR R7, [R11, #80]

RFR1

R8ADD

RFR1

R8

RF

R9+

DM

RFR8

R4

RF

R10&

DMIM

AND

IMORR

AND

SUB

|R6

R5R11

RF80

R11

RF+

DMSTR

IM

R7

9

R3

R2

R3

R2-

R8

ORRORR R11, R5, R6

IM

IdealIPC: 2 ActualIPC: 6/5=1.2

Chapter7<185>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Looksaheadacrossmul>pleinstruc>ons•  Issuesasmanyinstruc>onsaspossibleatonce•  Issuesinstruc>onsoutoforder(aslongasnodependencies)

•  Dependencies:–  RAW(readaserwrite):oneinstruc>onwrites,laterinstruc>onreadsaregister

–  WAR(writeaserread):oneinstruc>onreads,laterinstruc>onwritesaregister

–  WAW(writeaserwrite):oneinstruc>onwrites,laterinstruc>onwritesaregister

OutofOrderProcessor

Chapter7<186>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Instruc*onlevelparallelism(ILP):numberofinstruc>onthatcanbeissuedsimultaneously(average<3)

•  Scoreboard:tablethatkeepstrackof:– Instruc>onswai>ngtoissue– Availablefunc>onalunits– Dependencies

OutofOrderProcessor

Chapter7<187>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

LDR R8, [R0, #40] ADD R9, R8, R1 SUB R8, R2, R3 IdealIPC: 2 AND R10, R4, R8 ActualIPC: 6/4=1.5 ORR R11, R5, R6

STR R7, [R11, #80]

OutofOrderProcessorExample

Time(cycles)

1 2 3 4 5 6 7 8

RF40

R0

RF

R8+

DMIM

LDRLDR R8, [R0, #40]

ADD R9, R8, R1

SUB R8, R2, R3

AND R10, R4, R8

STR R7, [R11, #80]

ORR|R6

R5R11

RF80

R11

RF+

DMSTR R7

ORR R11, R5, R6

IM

RFR1

R8

RF

R9+

DMIM

ADD

SUB-R3

R2R8

two cycle latencybetween load anduse of R8

RAW

WAR

RAW

RFR8

R4

RF&

DMAND

IM

R10

RAW

Chapter7<188>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

LDR R8, [R0, #40] ADD R9, R8, R1 SUB R8, R2, R3 IdealIPC: 2 AND R10, R4, R8 ActualIPC: 6/3=2 ORR R11, R5, R6

STR R7, [R11, #80]

RegisterRenaming

Time(cycles)

1 2 3 4 5 6 7

RF40

R0

RF

R8+

DMIM

LDRLDR R8, [R0, #40]

ADD R9, R8, R1

SUB T0, R2, R3

AND R10, R4, T0

STR R7, [R11, #80]

SUB-R3

R2T0

RFT0

R4

RF&

DMAND

R7

ORR R11, R5, R6IM

RFR1

R8

RF

R9+

DMIM

ADD

STR+80

R11

RAW

R6

R5|

ORR

2-cycle RAW

RAW

R10

R11

Chapter7<189>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  SingleInstruc>onMul>pleData(SIMD)–  Singleinstruc>onactsonmul>plepiecesofdataatonce–  Commonapplica>on:graphics–  Performshortarithme>copera>ons(alsocalledpackedarithmeKc)

•  Forexample,addeight8-bitelements

SIMD

a0

0781516232431 Bit position

D0a1a2a3

b0 D1b1b2b3

a0 + b0 D2a1 + b1a2 + b2a3 + b3

+

a4a5a6a7

b4b5b6b7

a4 + b4a5 + b5a6 + b6a7 + b7

3239404748555663

Chapter7<190>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Mul*threading– Wordprocessor:threadfortyping,spellchecking,prin>ng

•  Mul*processors– Mul>pleprocessors(cores)onasinglechip

AdvancedArchitectureTechniques

Chapter7<191>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Process:programrunningonacomputer– Mul>pleprocessescanrunatonce:e.g.,surfingWeb,playingmusic,wri>ngapaper

•  Thread:partofaprogram– Eachprocesshasmul>plethreads:e.g.,awordprocessormayhavethreadsfortyping,spellchecking,prin>ng

Threading:Defini>ons

Chapter7<192>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Onethreadrunsatonce•  Whenonethreadstalls(forexample,wai>ngformemory):– Architecturalstateofthatthreadstored– Architecturalstateofwai>ngthreadloadedintoprocessoranditruns

–  Calledcontextswitching•  Appearstouserlikeallthreadsrunningsimultaneously

ThreadsinConven>onalProcessor

Chapter7<193>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Mul>plecopiesofarchitecturalstate•  Mul>plethreadsac*veatonce:– Whenonethreadstalls,anotherrunsimmediately–  Ifonethreadcan’tkeepallexecu>onunitsbusy,anotherthreadcanusethem

•  Doesnotincreaseinstruc>on-levelparallelism(ILP)ofsinglethread,butincreasesthroughput

Intelcallsthis“hyperthreading”

Mul>threading

Chapter7<194>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Mul>pleprocessors(cores)withamethodofcommunica>onbetweenthem

•  Types:– Homogeneous:mul>plecoreswithsharedmainmemory

– Heterogeneous:separatecoresfordifferenttasks(forexample,DSPandCPUincellphone)

–  Clusters:eachcorehasownmemorysystem

Mul>processors

Chapter7<195>DigitalDesignandComputerArchitecture:ARM®Edi>on©2015

•  Pajerson&Hennessy’s:ComputerArchitecture:AQuanKtaKveApproach

•  Conferences:– www.cs.wisc.edu/~arch/www/–  ISCA(Interna>onalSymposiumonComputerArchitecture)

– HPCA(Interna>onalSymposiumonHighPerformanceComputerArchitecture)

OtherResources

Recommended