Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
10/5/17
1
CS61C:GreatIdeasinComputerArchitecture
Lecture12:Control&OperatingSpeed
KrsteAsanović &RandyKatz
http://inst.eecs.berkeley.edu/~cs61c/fa17
Agenda• FinishSingle-CycleRISC-VDatapath• Controller• InstructionTiming• PerformanceMeasures• IntroductiontoPipelining• PipelinedRISC-V Datapath• AndinConclusion,...CS61c Lecture12:Control&Performance 2
Recap:Addingbranchestodatapath
CS61c 3
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataAAddrD
DataB
DataD
Addr
DataWDataR
10
01
10
pc01
inst[11:7]
inst[19:15]inst[24:20]
inst[31:7]
alu
mem
wbalu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel RegWEn BrUnBrEq BrLT ASelBSel ALUSel MemRW WBSelPCSel
wb
ImplementingJALR Instruction(I-Format)
• JALRrd,rs,immediate−WritesPC+4toReg[rd](returnaddress)− SetsPC=Reg[rs1]+immediate− Usessameimmediates asarithmeticandloads
§ nomultiplicationby2bytes
4
Addingjalr todatapath
CS61c 5
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataA
AddrD
DataB
DataD
Addr
DataWDataR
1
0
0121
0pc
0
1
inst[11:7]
inst[19:15]inst[24:20]
inst[31:7]
pc+4alu
mem
wbalu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel RegWEn BrUnBrEq BrLT ASelBSel ALUSel MemRW WBSelPCSel
wb
Addingjalr todatapath
CS61c 6
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataA
AddrD
DataB
DataD
Addr
DataWDataR
1
0
0121
0pc
0
1
inst[11:7]
inst[19:15]inst[24:20]
inst[31:7]
pc+4alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel=I RegWEn=1
BrUn=* BrEq=* BrLT=*
Asel=0Bsel=1ALUSel=Add
MemRW=Read WBSel=2PCSel
wb
10/5/17
2
Implementingjal Instruction
• JALsavesPC+4inReg[rd](thereturnaddress)• SetPC=PC+offset(PC-relativejump)• Targetsomewherewithin±219 locations,2bytesapart
− ±218 32-bitinstructions• Immediateencodingoptimizedsimilarlytobranchinstructiontoreducehardwarecost
7
Addingjal todatapath
CS61c 8
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataAAddrD
DataB
DataD
Addr
DataWDataR
10
0121
0pc
01
inst[11:7]
inst[19:15]inst[24:20]
inst[31:7]
pc+4alu
mem
wbalu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel RegWEn BrUnBrEq BrLT ASelBSel ALUSel MemRW WBSelPCSel
wb
Addingjal todatapath
CS61c 9
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataAAddrD
DataB
DataD
Addr
DataWDataR
10
0121
0pc
01
inst[11:7]
inst[19:15]inst[24:20]
inst[31:7]
pc+4alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel=J RegWEn=1
BrUn=* BrEq=* BrLT=*
Asel=1Bsel=1ALUSel=Add
MemRW=Read WBSel=2PCSel
wb
“UpperImmediate”instructions
• Has20-bitimmediateinupper20bitsof32-bitinstructionword• Onedestinationregister,rd• Usedfortwoinstructions
− LUI– LoadUpperImmediate(addtozero)− AUIPC– AddUpperImmediatetoPC
10
Implementinglui
CS61c 11
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataA
AddrD
DataB
DataD
Addr
DataWDataR
1
0
0121
00
1
inst[11:7]
inst[19:15]inst[24:20]
inst[31:7]
pc+4alu
mem
wbalu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel=U RegWEn=1
BrUn=* BrE=* BrLT=*
Asel=*Bsel=1 ALUSel=B MemRW=Read WBSel=1PCSel=pc+4
wb
pc
Implementingauipc
CS61c 12
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataA
AddrD
DataB
DataD
Addr
DataWDataR
1
0
0121
00
1
inst[11:7]
inst[19:15]inst[24:20]
inst[31:7]
pc+4alu
mem
wbalu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel=U RegWEn=1
BrUn=* BrE=* BrLT=*
Asel=1Bsel=1 ALUSel=Add MemRW=0 WBSel=1PCSel=pc+4
wb
pc
10/5/17
3
Recap:CompleteRV32IISA
13
NotinCS61C
RV32Ihas47instructionstotal37instructionscoveredinCS61C
Single-CycleRISC-VRV32IDatapath
CS61c 14
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataAAddrD
DataB
DataD
Addr
DataWDataR
10
0121
0pc
01
inst[11:7]
inst[19:15]inst[24:20]
inst[31:7]
pc+4alu
mem
wbalu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel RegWEn BrUnBrEq BrLT ASelBSel ALUSel MemRW WBSelPCSel
wb
Agenda• FinishSingle-CycleRISC-VDatapath• Controller• InstructionTiming• PerformanceMeasures• IntroductiontoPipelining• PipelinedRISC-V Datapath• AndinConclusion,...CS61c Lecture12:Control&Performance 15
Processor
CS61c Lecture12:Control&Performance 16
Processor
Control
DatapathPC
RegistersArithmetic&LogicUnit
(ALU)
Memory
Bytes
Enable?Read/Write
Address
WriteDataReadData
Processor-MemoryInterface
Program
Data
Single-CycleRISC-VRV32IDatapath
CS61c 17
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataA
AddrD
DataB
DataD
Addr
DataWDataR
1
0
0121
0pc
0
1
inst[11:7]
inst[19:15]inst[24:20]
inst[31:7]
pc+4alu
mem
wbalu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
ControlLogic
inst[31:0] ImmSel RegWEn BrUnBrEq BrLT ASelBSel ALUSel MemRW WBSelPCSel
wb
ControlLogicTruthTable(incomplete)
CS61c Lecture12:Control&Performance 18
Inst[31:0] BrEq BrLT PCSel ImmSel BrUn ASel BSel ALUSel MemRW RegWEn WBSel
add * * +4 * * Reg Reg Add Read 1 ALU
sub * * +4 * * Reg Reg Sub Read 1 ALU
(R-R Op) * * +4 * * Reg Reg (Op) Read 1 ALU
addi * * +4 I * Reg Imm Add Read 1 ALU
lw * * +4 I * Reg Imm Add Read 1 Mem
sw * * +4 S * Reg Imm Add Write 0 *
beq 0 * +4 B * PC Imm Add Read 0 *
beq 1 * ALU B * PC Imm Add Read 0 *
bne 0 * ALU B * PC Imm Add Read 0 *
bne 1 * +4 B * PC Imm Add Read 0 *
blt * 1 ALU B 0 PC Imm Add Read 0 *
bltu * 1 ALU B 1 PC Imm Add Read 0 *
jalr * * ALU I * Reg Imm Add Read 1 PC+4
jal * * ALU J * PC Imm Add Read 1 PC+4
auipc * * +4 U * PC Imm Add Read 1 ALU
10/5/17
4
ControlRealizationOptions• ROM
−“Read-OnlyMemory”−Regularstructure−Canbeeasilyreprogrammed
§ fixerrors§ addinstructions
−Popularwhendesigningcontrollogicmanually• CombinatorialLogic
−Today,chipdesignersuselogicsynthesistoolstoconverttruthtablestonetworksofgates
CS61c Lecture12:Control&Performance 19
RV32I,anine-bitISA!
20
NotinCS61C
Instructiontypeencodedusingonly9bitsinst[30],inst[14:12],inst[6:2]
inst[30] inst[14:12] inst[6:2]
ROM-basedControl
CS61c Lecture12:Control&Performance 21
ROM
Inst[30,14:12,6:2] BrEq9
PCSel
ALUSel[3:0]4
11-bitaddress(inputs)
15databits(outputs)
BrLT
ImmSel[2:0]3BrUnASelBSel
MemRWRegWEnWBSel[1:0]2
ROMControllerImplementation
CS61c Lecture12:Control&Performance 22
Control Word foradd
Control Word forsub
Control Word foror
.
.
.
Address
Decode
r...
Inst[]BrEQBrLT
Controlleroutput(PCSel,ImmSel,…)
addsubor
jal
11
Administrivia• Homework2Duetomorrow11:59pm• Project1Part1DueMondayOct.9
− Part2dueMondayOct.16
• Midterm1RegradesduenextTuesday− TalktoaTAifyoudon’tunderstandamidtermquestionorareunsureofaregrade
CS61c Lecture12:Control&Performance 23
Break!
10/5/17 24
10/5/17
5
Agenda• FinishSingle-CycleRISC-VDatapath• Controller• InstructionTiming• PerformanceMeasures• IntroductiontoPipelining• PipelinedRISC-V Datapath• AndinConclusion,...CS61c Lecture12:Control&Performance 25
InstructionTiming
IF ID EX MEM WB Total
I-MEM Reg Read ALU D-MEM Reg W
200 ps 100ps 200 ps 200ps 100ps 800psCS61c Lecture12:Control&Performance 26
InstructionTiming
• Maximumclockfrequency− fmax =1/800ps=1.25GHz
• Mostblocksidlemostofthetime− E.g.fmax,ALU =1/200ps=5GHz!− HowcanwekeepALUbusyallthetime?− 5billionadds/sec,ratherthanjust1.25billion?− Idea:Factories usethreeemployeeshifts- equipmentisalwaysbusy!
Instr IF=200ps ID=100ps ALU=200ps MEM=200ps WB =100ps Total
add X X X X 600ps
beq X X X 500ps
jal X X X 500ps
lw X X X X X 800ps
sw X X X X 700ps
Agenda• FinishSingle-CycleRISC-VDatapath• Controller• InstructionTiming• PerformanceMeasures• IntroductiontoPipelining• PipelinedRISC-V Datapath• AndinConclusion,...CS61c Lecture12:Control&Performance 28
PerformanceMeasures• “Our”RISC-Vexecutesinstructionsat1.25GHz
−1instructionevery800ps
• Canweimproveitsperformance?−Whatdowemeanwiththisstatement?−Notsoobvious:
§ Quickerresponsetime,soonejobfinishesfaster?§ Morejobsperunittime(e.g.webserverreturningpages)?§ Longerbatterylife?
CS61c Lecture12:Control&Performance 29
TransportationAnalogySportsCar Bus
Passenger Capacity 2 50Travel Speed 200mph 50mphGasMileage 5mpg 2mpg
CS61c Lecture12:Control&Performance 30
SportsCar BusTravelTime 15min 60minTimefor100 passengers 750min 120minGallonsperpassenger 5gallons 0.5 gallons
50Miletrip:
10/5/17
6
ComputerAnalogyTransportation Computer
TripTime Program executiontime:e.g.timetoupdatedisplay
Timefor100passengers Throughput:e.g.number ofserverrequestshandledperhour
Gallonsper passenger Energypertask*:e.g.how manymoviesyoucanwatchperbatterychargeorenergybillfordatacenter
31
*Note: powerisnotagoodmeasure,sincelow-powerCPUmightrunforalongtimetocompleteonetaskconsumingmoreenergythanfastercomputerrunningathigherpowerforashortertime
“IronLaw”ofProcessorPerformance
32
Time =Instructions Cycles TimeProgramProgram*Instruction*Cycle
InstructionsperProgram
Determinedby• Task• Algorithm,e.g.O(N2) vsO(N)• Programminglanguage• Compiler• InstructionSetArchitecture(ISA)
CS61c Lecture12:Control&Performance 33
(Average)ClockcyclesperInstruction
Determinedby• ISA• Processorimplementation(ormicroarchitecture)• E.g.for“our”single-cycleRISC-Vdesign,CPI=1• Complexinstructions(e.g.strcpy),CPI>>1• Superscalarprocessors,CPI<1(nextlecture)
CS61c Lecture12:Control&Performance 34
TimeperCycle(1/Frequency)
Determinedby• Processormicroarchitecture(determinescriticalpath
throughlogicgates)• Technology(e.g.14nmversus28nm)• Powerbudget(lowervoltagesreducetransistorspeed)
CS61c Lecture12:Control&Performance 35
SpeedTradeoffExample• Forsometask(e.g.imagecompression)…
CS61c Lecture12:Control&Performance 36
ProcessorA Processor B#Instructions 1Million 1.5MillionAverageCPI 2.5 1Clockrate f 2.5 GHz 2GHzExecutiontime 1ms 0.75ms
ProcessorBisfasterforthistask,despiteexecutingmoreinstructionsandhavingalowerclockrate!
10/5/17
7
EnergyperTask
37
Energy =Instructions EnergyProgramProgram*Instruction
Energy αInstructions *CV2
ProgramProgram
“Capacitance” dependsontechnology,processorfeaturese.g.#ofcores
Supplyvoltage,e.g.1V
Wanttoreducecapacitanceandvoltagetoreduceenergy/task
EnergyTradeoffExample• “Next-generation”processor
− C(Moore’sLaw): -15%− Supplyvoltage,Vsup: -15%− Energyconsumption: 1- (1-0.85)3 =-39%
• Significantlyimprovedenergyefficiencythanksto−Moore’sLawAND− Reducedsupplyvoltage
CS61c Lecture12:Control&Performance 38
Energy“IronLaw”
39
Performance=Power*EnergyEfficiency(Tasks/Second)(Joules/Second)(Tasks/Joule)
• Energyefficiency(e.g.,instructions/Joule)iskeymetricinallcomputingdevices• Forpower-constrainedsystems(e.g.,20MWdatacenter),needbetterenergyefficiencytogetmoreperformanceatsamepower• Forenergy-constrainedsystems(e.g.,1Wphone),needbetterenergyefficiencytoprolongbatterylife
EndofScaling• Inrecentyears,industryhasnotbeenabletoreducesupplyvoltagemuch,asreducingitfurtherwouldmeanincreasing“leakagepower”wheretransistorswitchesdon’tfullyturnoff(morelikedimmerswitchthanon-offswitch)• Also,sizeoftransistorsandhencecapacitance,notshrinkingasmuchasbeforebetweentransistorgenerations• Powerbecomesagrowingconcern– the“powerwall”• Cost-effectiveair-cooledchiplimitaround~150W
CS61c Lecture12:Control&Performance 40
0
1
10
100
1000
10000
100000
1000000
10000000
1970 1975 1980 1985 1990 1995 2000 2005 2010
Transistors (Thousands)Frequency (MHz)Power (W)Cores
ProcessorTrends
41[Olukotun, Hammond,Sutter,Smith,Batten]
Break!
10/5/17 42
10/5/17
8
Agenda• FinishSingle-CycleRISC-VDatapath• Controller• InstructionTiming• PerformanceMeasures• IntroductiontoPipelining• PipelinedRISC-V Datapath• AndinConclusion,...CS61c Lecture12:Control&Performance 43
Pipelining• Afamiliarexample:
− Gettingauniversitydegree
• ShortageofComputerscientists(yourstartupisgrowing):− Howlongdoesittaketoeducate16,000?
CS61c Lecture12:Control&Performance 44
Year1 Year2 Year3 Year4
ComputerScientistEducation
CS61c Lecture12:Control&Performance 45
• Option1:serial
• Option2:pipelining
7years
16,000in16years,averagethroughputis1000/year
• 16,000in7years• Steadystatethroughputis4000/year• Resourcesusedefficiently• 4-foldimprovementoverserialeducation
4000graduate
4years
4000graduate
4years
4000graduate
4years
4000
4years
4000graduate4000graduate
4000graduate4000graduate
4000enter
1year
LatencyversusThroughput• Latency
− Timefromenteringcollegetograduation− Serial 4years− Pipelining 4years
• Throughput− Averagenumberofstudentsgraduatingeachyear− Serial 1000− Pipelining 4000
• Pipelining− Increasesthroughput(4xinthisexample)− Butdoesnothingtolatency
§ sometimesworse(additionaloverheade.g.forshifttransition)CS61c Lecture12:Control&Performance 46
SimultaneousversusSequential
CS61c Lecture12:Control&Performance 47
• Whathappenssequentially?• Whathappenssimultaneously?
4years
4000graduate4000graduate
4000graduate4000graduate
4000graduate4000graduate
4000graduate4000graduate
4000graduate4000graduate
Agenda• FinishSingle-CycleRISC-VDatapath• Controller• InstructionTiming• PerformanceMeasures• IntroductiontoPipelining• PipelinedRISC-V Datapath• AndinConclusion,...CS61c Lecture12:Control&Performance 48
10/5/17
9
PipeliningwithRISC-V
CS61c 49
Phase Pictogram tstep Serial
Instruction Fetch 200 ps
Reg Read 100ps
ALU 200ps
Memory 200ps
RegisterWrite 100ps
tinstruction 800ps
addt0,t1,t2
ort3,t4,t5
sll t6,t0,t3tcycle
instructionsequence
tinstruction
tcycle Pipelined
200 ps
200ps
200ps
200ps
200ps
1000ps
PipeliningwithRISC-V
CS61c 50
addt0,t1,t2
ort3,t4,t5
sll t6,t0,t3tcycle
instructionsequence
tinstruction
Single Cycle Pipelining
Timing tstep =100…200ps tcycle =200ps
Register accessonly100ps Allcyclessamelength
Instruction time,tinstruction =tcycle =800ps 1000ps
Clockrate,fs 1/800ps =1.25GHz 1/200 ps =5GHz
Relativespeed 1x 4x
SequentialvsSimultaneous
addt0,t1,t2
ort3,t4,t5
sll t6,t0,t3
tcycle=200ps
instructionsequence
tinstruction =1000ps
sw t0,4(t3)
lw t0,8(t3)
addi t2,t2,1
Whathappenssequentially,whathappenssimultaneously?
Agenda• FinishSingle-CycleRISC-VDatapath• Controller• InstructionTiming• PerformanceMeasures• IntroductiontoPipelining• PipelinedRISC-V Datapath• AndinConclusion,...CS61c Lecture12:Control&Performance 52
AndinConclusion,…• Controller
− Tellsuniversaldatapathhowtoexecuteeachinstruction• Instructiontiming
− Setbyinstructioncomplexity,architecture,technology− Pipeliningincreasesclockfrequency,“instructionspersecond”
§ Butdoesnotreducetimetocompleteinstruction
• Performancemeasures− Differentmeasuresdependingonobjective
§ Responsetime§ Jobs/second§ Energypertask
CS61c Lecture12:Control&Performance 53