CS 152 Computer Architecture and Engineering CS252...

CS152ComputerArchitectureandEngineeringCS252GraduateComputerArchitecture

Lecture13–VLIW

KrsteAsanovicElectricalEngineeringandComputerSciences

UniversityofCaliforniaatBerkeley

http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152

LastTimeinLecture12

§ Branchprediction– temporal,historyofasinglebranch– spatial,basedonpaththroughmultiplebranches

§ BranchHistoryTable(BHT)vs.BranchHistoryBuffer(BTB)– tradeoffincapacityversuslatency

§ Return-AddressStack(RAS)– specializedstructuretopredictsubroutinereturnaddresses

§ Fetchingmorethanonebasicblockpercycle– predictingmultiplebranches– tracecache

SuperscalarControlLogicScaling

§ EachissuedinstructionmustsomehowcheckagainstW*Linstructions,i.e.,growthinhardwareµW*(W*L)

§ Forin-ordermachines,Lisrelatedtopipelinelatenciesandcheckisdoneduringissue(interlocksorscoreboard)

§ Forout-of-ordermachines,Lalsoincludestimespentininstructionbuffers(instructionwindoworROB),andcheckisdonebybroadcastingtagstowaitinginstructionsatwriteback(completion)

§ AsWincreases,largerinstructionwindowisneededtofindenoughparallelismtokeepmachinebusy=>greaterL

=>Out-of-ordercontrollogicgrowsfasterthanW2 (~W3)3

LifetimeL

IssueGroup

PreviouslyIssued

Instructions

IssueWidthW

Out-of-OrderControlComplexity:MIPSR10000

ControlLogic

[SGI/MIPSTechnologiesInc.,1995]

SequentialISABottleneck

Checkinstructiondependencies

Superscalarprocessor

a = foo(b);

for (i=0, i<

Sequentialsourcecode

Superscalarcompiler

Findindependentoperations

Scheduleoperations

Sequentialmachinecode

Scheduleexecution

VLIW:VeryLongInstructionWord

§Multipleoperationspackedintooneinstruction§ Eachoperationslotisforafixedfunction§ Constantoperationlatenciesarespecified§ Architecturerequiresguaranteeof:

– Parallelismwithinaninstruction=>nocross-operationRAWcheck

– Nodatausebeforedataready=>nodatainterlocks6

TwoIntegerUnits,Single-CycleLatency

TwoLoad/StoreUnits,Three-CycleLatency TwoFloating-PointUnits,

Four-CycleLatency

IntOp2 MemOp1 MemOp2 FPOp1 FPOp2Int Op1

EarlyVLIWMachines

§ FPSAP120B(1976)– scientificattachedarrayprocessor– firstcommercialwideinstructionmachine– hand-codedvectormathlibrariesusingsoftwarepipeliningandloopunrolling

§Multiflow Trace(1987)– commercializationofideasfromFisher’sYalegroupincluding“tracescheduling”

– availableinconfigurationswith7,14,or28operations/instruction

– 28operationspackedintoa1024-bitinstructionword

§ Cydrome Cydra-5(1987)– 7operationsencodedin256-bitinstructionword– rotatingregisterfile

VLIWCompilerResponsibilities

§Scheduleoperationstomaximizeparallelexecution

§Guaranteesintra-instructionparallelism

§Scheduletoavoiddatahazards(nointerlocks)– TypicallyseparatesoperationswithexplicitNOPs

LoopExecution

HowmanyFPops/cycle?

for (i=0; i<N; i++)

B[i] = A[i] + C;Int1 Int 2 M1 M2 FP+ FPx

loop: fldadd x1

fsdadd x2 bne

1 fadd / 8 cycles = 0.125

loop: fld f1, 0(x1)

add x1, 8

fadd f2, f0, f1

fsd f2, 0(x2)

add x2, 8

bne x1, x3, loop

Compile

Schedule

LoopUnrolling

for (i=0; i<N; i++)

B[i] = A[i] + C;

for (i=0; i<N; i+=4)

B[i] = A[i] + C;

B[i+1] = A[i+1] + C;

B[i+2] = A[i+2] + C;

B[i+3] = A[i+3] + C;

Unroll inner loop to perform 4 iterations at once

Need to handle values of N that are not multiples of unrolling factor with final cleanup loop

SchedulingLoopUnrolledCode

loop: fld f1, 0(x1)fld f2, 8(x1)fld f3, 16(x1)fld f4, 24(x1)add x1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4fsd f5, 0(x2)fsd f6, 8(x2)fsd f7, 16(x2)fsd f8, 24(x2)add x2, 32bne x1, x3, loop

Schedule

Int1 Int 2 M1 M2 FP+ FPx

Unroll 4 ways

fld f1fld f2fld f3fld f4add x1 fadd f5

fadd f6fadd f7fadd f8

fsd f5fsd f6fsd f7fsd f8add x2 bne

How many FLOPS/cycle?4 fadds / 11 cycles = 0.36

SoftwarePipelining

HowmanyFLOPS/cycle?

loop: fld f1, 0(x1)fld f2, 8(x1)fld f3, 16(x1)fld f4, 24(x1)add x1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4fsd f5, 0(x2)fsd f6, 8(x2)fsd f7, 16(x2)add x2, 32fsd f8, -8(x2)bne x1, x3, loop

Int1 Int 2 M1 M2 FP+ FPxUnroll 4 ways firstfld f1fld f2fld f3fld f4

fadd f5fadd f6fadd f7fadd f8

fsd f5fsd f6fsd f7fsd f8

add x1

add x2bne

fld f1fld f2fld f3fld f4

fsd f5fsd f6fsd f7fsd f8

add x1

add x2bne

fld f1fld f2fld f3fld f4

fsd f5

add x1

loop:iterate

prolog

epilog

4 fadds / 4 cycles = 1

SoftwarePipeliningvs.LoopUnrolling

performance

Loop Unrolled

Software Pipelined

Startup overhead

Wind-down overhead

Loop Iteration

Software pipelining pays startup/wind-down costs only once per loop, not once per iteration

CS152Administrivia

§ Lab2extension,dueFridayMarch15§ PS3dueMondayMarch18§ Midtermgradeswillbereleasedtoday§ RegraderequestswillbethroughGradescope

– WindowopensFriday,3/15/19at4pm(aftersection)– WindowclosesFriday,3/22/19at12pm(beforesection)

Midterm 1 Grades: mean = 41.4, σ = 11.3CS152Administrivia

CS252Administrivia

§ ReadingsnextweekonOoO superscalarmicroprocessors

Whatiftherearenoloops?

§ Brancheslimitbasicblocksizeincontrol-flowintensiveirregularcode

§ DifficulttofindILPinindividualbasicblocks

Basicblock

TraceScheduling[Fisher,Ellis]

§ Pickstringofbasicblocks,atrace,thatrepresentsmostfrequentbranchpath

§ Useprofilingfeedback orcompilerheuristicstofindcommonbranchpaths

§ Schedulewhole“trace”atonce§ Addfixup codetocopewithbranchesjumpingoutoftrace

Problemswith“Classic”VLIW

§ Object-codecompatibility– havetorecompileallcodeforeverymachine,evenfortwomachinesinsamegeneration

§ Objectcodesize– instructionpaddingwastesinstructionmemory/cache– loopunrolling/softwarepipeliningreplicatescode

§ Schedulingvariablelatencymemoryoperations– cachesand/ormemorybankconflictsimposestaticallyunpredictablevariability

§ Knowingbranchprobabilities– Profilingrequiresansignificantextrastepinbuildprocess

§ Schedulingforstaticallyunpredictablebranches– optimalschedulevarieswithbranchpath

VLIWInstructionEncoding

§ Schemestoreduceeffectofunusedfields– Compressedformatinmemory,expandonI-cacherefill

• usedinMultiflow Trace• introducesinstructionaddressingchallenge

– Markparallelgroups• usedinTMS320C6xDSPs,IntelIA-64

– Provideasingle-opVLIWinstruction• Cydra-5UniOp instructions

Group 1 Group 2 Group 3

IntelItanium,EPICIA-64

§ EPICisthestyleofarchitecture(cf.CISC,RISC)– ExplicitlyParallelInstructionComputing(reallyjustVLIW)

§ IA-64isIntel’schosenISA(cf.x86,MIPS)– IA-64=IntelArchitecture64-bit– Anobject-code-compatibleVLIW

§ MercedwasfirstItaniumimplementation(cf.8086)– Firstcustomershipmentexpected1997(actually2001)– McKinley,secondimplementationshippedin2002– Recentversion,Poulson,eightcores,32nm,announced2011

EightCoreItanium“Poulson”[Intel2011]

§ 8cores§ 1-cycle16KBL1I&Dcaches§ 9-cycle512KBL2I-cache§ 8-cycle256KBL2D-cache§ 32MBsharedL3cache§ 544mm2 in32nmCMOS§ Over3billiontransistors

§ Coresare2-waymultithreaded§ 6instruction/cyclefetch

– Two128-bitbundles

§ Upto12insts/cycleexecute

IA-64InstructionFormat

§ Templatebitsdescribegroupingoftheseinstructionswithothersinadjacentbundles

§ Eachgroupcontainsinstructionsthatcanexecuteinparallel

Instruction 2 Instruction 1 Instruction 0 Template

128-bit instruction bundle

group i group i+1 group i+2group i-1

bundle j bundle j+1bundle j+2bundle j-1

IA-64Registers

§ 128GeneralPurpose64-bitIntegerRegisters§ 128GeneralPurpose64/80-bitFloatingPointRegisters§ 641-bitPredicateRegisters

§ GPRs “rotate”toreducecodesizeforsoftwarepipelinedloops– Rotationisasimpleformofregisterrenamingallowingoneinstructiontoaddressdifferentphysicalregistersoneachiteration

RotatingRegisterFiles

Problems:Scheduledloopsrequirelotsofregisters,Lotsofduplicatedcodeinprolog,epilog

Solution:Allocatenewsetofregistersforeachloopiteration

RotatingRegisterFile

P0P1P2P3P4P5P6P7

RotatingRegisterBase(RRB)registerpointstobaseofcurrentregisterset.Valueaddedontologicalregisterspecifier togivephysicalregisternumber.Usually,splitintorotatingandnon-rotatingregisters.

RotatingRegisterFile(PreviousLoopExample)

bloopsd f9, ()fadd f5, f4, ...ld f1, ()

Three cycle load latency encoded as difference of 3

in register specifier number (f4 - f1 = 3)

Four cycle fadd latency encoded as difference of 4

in register specifier number (f9 – f5 = 4)

bloopsd P17, ()fadd P13, P12,ld P9, () RRB=8bloopsd P16, ()fadd P12, P11,ld P8, () RRB=7bloopsd P15, ()fadd P11, P10,ld P7, () RRB=6bloopsd P14, ()fadd P10, P9,ld P6, () RRB=5bloopsd P13, ()fadd P9, P8,ld P5, () RRB=4bloopsd P12, ()fadd P8, P7,ld P4, () RRB=3bloopsd P11, ()fadd P7, P6,ld P3, () RRB=2bloopsd P10, ()fadd P6, P5,ld P2, () RRB=1

IA-64PredicatedExecution

Problem:Mispredicted brancheslimitILPSolution:Eliminatehardtopredictbrancheswithpredicatedexecution

– AlmostallIA-64instructionscanbeexecutedconditionallyunderpredicate– InstructionbecomesNOPifpredicateregisterfalse

Inst 1Inst 2br a==b, b2

Inst 3Inst 4br b3

Inst 5Inst 6

Inst 7Inst 8

Four basic blocks

Inst 1Inst 2p1,p2 <- cmp(a==b)(p1) Inst 3 || (p2) Inst 5(p1) Inst 4 || (p2) Inst 6Inst 7Inst 8

Predication

One basic block

Mahlke et al, ISCA95: On average >50% branches removed

Warning:Complicatesbypassing!

IA-64SpeculativeExecution

Problem: Branchesrestrictcompilercodemotion

Inst 1Inst 2br a==b, b2

Load r1Use r1Inst 3

Can’t move load above branch because might cause spurious exception

Load.s r1Inst 1Inst 2br a==b, b2

Chk.s r1Use r1Inst 3

Speculative load never causes exception, but sets “poison” bit on destination register

Check for exception in original home block jumps to fixup code if exception detected

Particularly useful for scheduling long latency loads early

Solution: Speculativeoperationsthatdon’tcauseexceptions

IA-64DataSpeculation

Problem:Possiblememoryhazardslimitcodescheduling

Requires associative hardware in address check table

Inst 1Inst 2Store

Load r1Use r1Inst 3

Can’t move load above store because store might be to same address

Load.a r1Inst 1Inst 2Store

Load.cUse r1Inst 3

Data speculative load adds address to address check table

Store invalidates any matching loads in address check table

Check if load invalid (or missing), jump to fixup code if so

Solution:Hardwaretocheckpointerhazards

LimitsofStaticScheduling

§ Unpredictablebranches§ Variablememorylatency(unpredictablecachemisses)§ Codesizeexplosion§ Compilercomplexity§ Despiteseveralattempts,VLIWhasfailedingeneral-purposecomputingarena(sofar).– MorecomplexVLIWarchitecturesareclosetoin-ordersuperscalarincomplexity,norealadvantageonlargecomplexapps.

§ SuccessfulinembeddedDSPmarket– SimplerVLIWswithmoreconstrainedenvironment,friendliercode.

IntelKillsItanium

§ DonaldKnuth“ …Itaniumapproachthatwassupposedtobesoterrific—untilitturnedoutthatthewished-forcompilerswerebasicallyimpossibletowrite.”

§ “IntelofficiallyannouncedtheendoflifeandproductdiscontinuanceoftheItaniumCPUfamilyonJanuary30th,2019”,Wikipedia

Acknowledgements

§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:– Arvind (MIT)– JoelEmer (Intel/MIT)– JamesHoe(CMU)– JohnKubiatowicz (UCB)– DavidPatterson(UCB)

CS 152 Computer Architecture and Engineering CS252...

Documents

System Architecture ModuleSystem Architecture Moduleathena.ecs.csus.edu/~grandajj/ME296J/1. Lectures/6. System... · System Architecture ModuleSystem Architecture Module ... Architecture

Computer Science 194-23 The Art and Science of Digital ...gamescrafters.berkeley.edu/~cs194-23/sp13/slides/lecture10.pdf · Computer Science 194-23 The Art and Science of Digital

Landscape Architecture & Architecture Portfolio

Computer Science 194-23 The Art and Science of Digital ...gamescrafters.berkeley.edu/~cs194-23/sp13/slides/lecture9.pdf · Computer Science 194-23 The Art and Science of Digital Photography

ART & ARCHITECTURE THE ARCHITECTURE

CS 61C: Great Ideas in Computer Architecture Control and ...gamescrafters.berkeley.edu/.../19/2016Sp-CS61C-L19... · Summary of the Control Signals (2/2) add sub ori lw sw beq jump

ARCHITECTURE COMMUNICATION / LAYOUTING ARCHITECTURE

Architectural Patterns Support Lecture. Software Architecture l Architecture is OVERLOADED System architecture Application architecture l Architecture

Experiment 2: Discrete BJT Op-Amps (Part I)gamescrafters.berkeley.edu/~ee140/sp11/labs/Lab2.ee140.s11.v1.pdf · Experiment 2: Discrete BJT Op-Amps (Part I) This is a three-week laboratory

CS 252 Graduate Computer Architecture Recap: …inst.eecs.berkeley.edu/~cs252/fa07/lectures/L06-VLIW.pdf · ¥Path-based predictors can achieve ... Ðhand-coded vector math libraries

Software Architecture: Architecture Constraints

Arquitetura de Computadores II - dcc.ufrj.brdcc.ufrj.br/~gabriel/arqcomp2/VLIW.pdf · técnicas de escalonamento por “software” aplicadas em tempo de compilação ou através

Alex Haw Architecture - 101008 - Interactive Architecture - Media Architecture - 211

Architecting the Enterprise? Enterprise Architecture · Enterprise Architecture is the Glue between Business Architecture and IT Architecture Business Architecture IT Architecture

ARCHITECTURE ARCHITECTURE

Landscape Architecture Landscape Architecture

The Relational Model CS 186 IS MOVING!!!!gamescrafters.berkeley.edu/~cs186/sp06/lecs/lecture2RM6up.pdf · – “Intergalactic Standard for Data” – Stands for Structured Query

EHR Architecture Architecture Overvie · openEHR Architecture Architecture Overview Keywords: EHR, reference model, architecture, openehr Issuer: openEHR Specification Program

Enterprise Architecture for Architecture Driven Planning ...sparxsystems.com/press/articles/pdf/EAforADP.pdf · Enterprise Architecture for Architecture ... Enterprise Architecture

inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine ...gamescrafters.berkeley.edu/~cs61c/fa06/lectures/40/... · by software), how do we implement this in hardware? •Recall