Click here to load reader

Advanced Microarchitecture

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Advanced Microarchitecture. Multi-This , Multi-That, …. Limits on IPC. Lam92 This paper focused on impact of control flow on ILP Speculative execution can expose 10-400 IPC assumes no machine limitations except for control dependencies and actual dataflow dependencies Wall91 - PowerPoint PPT Presentation

Text of Advanced Microarchitecture

CS8803: Advanced Microarchitecture

Advanced MicroarchitectureMulti-This, Multi-That, 1Limits on IPCLam92This paper focused on impact of control flow on ILPSpeculative execution can expose 10-400 IPCassumes no machine limitations except for control dependencies and actual dataflow dependenciesWall91This paper looked at limits more broadlyNo branch prediction, no register renaming, no memory disambiguation: 1-2 IPC-entry bpred, 256 physical registers, perfect memory disambiguation: 4-45 IPCperfect bpred, register renaming and memory disambiguation: 7-60 IPCThis paper did not consider control independent instructionsLecture 17: Multi-This, Multi-That, ... 2This is obviously a very light treatment of multi-threaded/multi-core issues the course is focused on the core microarchitecture; you can easily make a whole separate course on multi-core/multi-threading/multi-processing issues.2Practical LimitsToday, 1-2 IPC sustainedfar from the 10s-100s reported by limit studiesLimited by:branch prediction accuracyunderlying DFGinfluenced by algorithms, compilermemory bottleneck

design complexityimplementation, test, validation, manufacturing, etc.powerdie areaLecture 17: Multi-This, Multi-That, ... 3Differences BetweenReal Hardware and Limit Studies?Real branch predictors arent 100% accurateMemory disambiguation is not perfectPhysical resources are limitedcant have infinite register renaming w/o infinite PRFneed infinite-entry ROB, RS and LSQneed 10s-100s of execution units for 10s-100s of IPCBandwidth/Latencies are limitedstudies assumed single-cycle executioninfinite fetch/commit bandwidthinfinite memory bandwidth (perfect caching)Lecture 17: Multi-This, Multi-That, ... 4Bridging the GapLecture 17: Multi-This, Multi-That, ... 5IPC100101Single-IssuePipelinedSuperscalarOut-of-Order(Today)SuperscalarOut-of-Order(Hypothetical-Aggressive)LimitsDiminishing returns w.r.t.larger instruction window,higher issue-widthPower has been growingexponentially as wellWatts/Past the Knee of the Curve?Lecture 17: Multi-This, Multi-That, ... 6EffortPerformanceScalarIn-OrderModerate-PipeSuperscalar/OOOVery-Deep-PipeAggressiveSuperscalar/OOOMade sense to goSuperscalar/OOO:good ROIVery little gain forsubstantial effortSo how do we get more Performance?Keep pushing IPC and/or frequenecy?possible, but too costlydesign complexity (time to market), cooling (cost), power delivery (cost), etc.

Look for other parallelismILP/IPC: fine-grained parallelismMulti-programming: coarse grained parallelismassumes multiple user-visible processing elementsall parallelism up to this point was user-invisibleLecture 17: Multi-This, Multi-That, ... 7User Visible/InvisibleAll microarchitecture performance gains up to this point were freefree in that no user intervention required beyond buying the new processor/systemrecompilation/rewriting could provide even more benefit, but you get some even if you do nothingMulti-processing pushes the problem of finding the parallelism to above the ISA interfaceLecture 17: Multi-This, Multi-That, ... 8Workload BenefitsLecture 17: Multi-This, Multi-That, ... 93-wideOOOCPUTask ATask B4-wideOOOCPUTask ATask BBenefit3-wideOOOCPUTask ATask B3-wideOOOCPU2-wideOOOCPUTask BTask A2-wideOOOCPUruntimeThis assumes you have twotasks/programs to executeJust showing that with parallelism, even two smaller cores may provide better overall performance than one regular core (and the smaller cores are likely to be cheaper from an area and power standpoint due to how these tend to grow super-linearly).9 If Only One TaskLecture 17: Multi-This, Multi-That, ... 103-wideOOOCPUTask A4-wideOOOCPUTask ABenefit3-wideOOOCPU3-wideOOOCPUTask A2-wideOOOCPU2-wideOOOCPUTask AruntimeIdleNo benefit over 1 CPUPerformancedegradation!But youre stuck if you care about single-thread performance10Sources of (Coarse) ParallelismDifferent applicationsMP3 player in background while you work on OfficeOther background tasks: OS/kernel, virus check, etc.Piped applicationsgunzip -c foo.gz | grep bar | perl some-script.plWithin the same applicationJava (scheduling, GC, etc.)Explicitly coded multi-threadingpthreads, MPI, etc.Lecture 17: Multi-This, Multi-That, ... 11(Execution) Latency vs. BandwidthDesktop processingtypically want an application to execute as quickly as possible (minimize latency)

Server/Enterprise processingoften throughput oriented (maximize bandwidth)latency of individual task less importantex. Amazon processing thousands of requests per minute: its ok if an individual request takes a few seconds more so long as total number of requests are processed in timeLecture 17: Multi-This, Multi-That, ... 12Benefit of MP Depends on WorkloadLimited number of parallel tasks to run on PCadding more CPUs than tasks provide zero performance benefitEven for parallel code, Amdahls law will likely result in sub-linear speedupLecture 17: Multi-This, Multi-That, ... 13parallelizable1CPU2CPUs3CPUs4CPUsIn practice, parallelizable portion may not be evenly divisibleCache Coherency ProtocolsNot covered in this courseYou should have seen a bunch of this in CS6290

Many different protocolsdifferent number of statesdifferent bandwidth/performance/complexity tradeoffs

current protocols usually referred to by their statesex. MESI, MOESI, etc.Lecture 17: Multi-This, Multi-That, ... 14Shared Memory FocusMost small-medium multi-processors (these days) use some sort of shared memoryshared memory doesnt scale as well to larger number of nodescommunications are broadcast basedbus becomes a severe bottleneckor you have to deal with directory-based implementationsmessage passing doesnt need centralized buscan arrange multi-processor like a graphnodes = CPUs, edges = independent links/routescan have multiple communications/messages in transit at the same timeLecture 17: Multi-This, Multi-That, ... 15SMP MachinesSMP = Symmetric Multi-ProcessingSymmetric = All CPUs are equalEqual = any process can run on any CPUcontrast with older parallel systems with master CPU and multiple worker CPUsLecture 17: Multi-This, Multi-That, ... 16

CPU0CPU1

CPU2CPU3Pictures found from google imagesHardware Modifications for SMPProcessormainly support for cache coherence protocolsincludes caches, write buffers, LSQcontrol complexity increases, as memory latencies may be substantially more variableMotherboardmultiple sockets (one per CPU)datapaths between CPUs and memory controllerOtherCase: larger for bigger mobo, better airflowPower: bigger power supply for N CPUsCooling: need to remove N CPUs worth of heatLecture 17: Multi-This, Multi-That, ... 17Chip-MultiprocessingSimple SMP on the same chipLecture 17: Multi-This, Multi-That, ... 18

Intel Smithfield Block DiagramAMD Dual-Core Athlon FXPictures found from google imagesShared CachesResources can be shared between CPUsex. IBM Power 5Lecture 17: Multi-This, Multi-That, ... 19

CPU0CPU1L2 cache shared betweenboth CPUs (no need tokeep two copies coherent)L3 cache is also shared (only tagsare on-chip; data are off-chip)Benefits?Cheaper than mobo-based SMPall/most interface logic integrated on to main chip (fewer total chips, single CPU socket, single interface to main memory)less power than mobo-based SMP as well (communication on-die is more power-efficient than chip-to-chip communication)Performanceon-chip communication is fasterEfficiencypotentially better use of hardware resources than trying to make wider/more OOO single-threaded CPULecture 17: Multi-This, Multi-That, ... 20Performance vs. Power2x CPUs not necessarily equal to 2x performance

2x CPUs power for eachmaybe a little better than if resources can be shared

Back-of-the-Envelope calculation:3.8 GHz CPU at 100WDual-core: 50W per CPUP V3: Vorig3/VCMP3 = 100W/50W VCMP = 0.8 Vorigf V: fCMP = 3.0GHzLecture 17: Multi-This, Multi-That, ... 21For a multi-threaded application, two 3.0GHz cores is probably better than one 3.8Ghz core.21Simultaneous Multi-ThreadingUni-Processor: 4-6 wide, lucky if you get 1-2 IPCpoor utilizationSMP: 2-4 CPUs, but need independent taskselse poor utilization as well

SMT: Idea is to use a single large uni-processor as a multi-processorLecture 17: Multi-This, Multi-That, ... 22SMT (2)Lecture 17: Multi-This, Multi-That, ... 23

Regular CPU

CMP

2x HW Cost

SMT (4 threads)Approx 1x HW CostOverview of SMT Hardware ChangesFor an N-way (N threads) SMT, we need:Ability to fetch from N threadsN sets of registers (including PCs)N rename tables (RATs)N virtual memory spacesBut we dont need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.)Lecture 17: Multi-This, Multi-That, ... 24SMT FetchDuplicate fetch logicLecture 17: Multi-This, Multi-That, ... 25I$fetchfetchfetchDecode, Rename, DispatchPC0PC1PC2RSCycle-Multiplexed fetch logicI$PC0PC1PC2cycle % NfetchDecode, etc.RSAlternativesOther-Multiplexed fetch logicDuplicate I$ as wellNot going to be talking about different/better SMT fetch policies here lots of papers available.25SMT RenameThread #1s R12 != Thread #2s R12separate name spacesneed to disambiguateLecture 17: Multi-This, Multi-That, ... 26RAT0RAT1Thread0Register #Thread1Register #PRFRATPRFThread-IDRegister #concatSMT Issue, Exec, Bypass, No change neededLecture 17: Multi-This, Multi-That, ... 27Thread 0:

Add R1 = R2 + R3Sub R4 = R1 R5Xor R3 = R1 ^ R4Load R2 = 0[R3]Thread 1:

Add R1 = R2 + R3Sub R4 = R1 R5Xor R3 = R1 ^ R4Load R2 = 0[R3]Thread 0:

Add T12 = RT20 + T8Sub T19 = T12 T16Xor T14 = T12 ^ T19Load T23 = 0[T14]Thread 1:

Add T17 = RT29 + T3Sub T5 = T17 T2Xor T31 = T17 ^ T5Load T25 = 0[T31]Add T12 = RT20 + T8Sub T19 = T12 T16Xor T14 = T12 ^ T19Load T23 = 0[T14]Add T17 = RT29 + T3Sub T5 = T17 T2Xor T31 = T17 ^ T5Load T25 = 0[T31]Shared RS EntriesAfter RenamingSMT CacheEach process has own virtual address spaceTLB must be thread-awaretranslate (thread-id,virtu

Search related