Intel Pentium M

  • View
    213

  • Download
    0

Embed Size (px)

Text of Intel Pentium M

  • Intel Pentium M

  • OutlineHistoryP6 Pipeline in detailNew featuresImproved Branch PredictionMicro-ops fusionSpeed Step technologyThermal Throttle 2Power and Performance

  • Quick Review of x868080 - 8-bit8086/8088 - 16-bit (8088 had 8-bit external data bus) - segmented memory model286 - introduction of protected mode, which included: segment limit checking, privilege levels, read- and exe-only segment options386 - 32-bit - segmented and flat memory model - paging486 - first pipeline - expanded the 386's ID and EX units into five-stage pipeline - first to include on-chip cache - integrated x87 FPU (before it was a coprocessor)Pentium (586) - first superscalar - included two pipelines, u and v - virtual-8086 mode - MMX soon afterPentium Pro (686 or P6) - three-way superscalar - dynamic execution - out-of-order execution, branch prediction, speculative execution - very successful micro-architecturePentium 2 and 3 - both P6Pentium 4 - new NetBurst architecturePentium M - enhanced P6

  • Pentium Pro RootsNexGen 586 (1994)Decomposes IA32 instructions into simplerRISC-like operations (R-ops or micro-ops)Decoupled ApproachNexGen bought by AMDAMD K5 (1995) also used micro-opsIntel Pentium ProIntels first use of decoupled architecture

  • Pentium-M OverviewIntroduced March 12, 2003Initially called BaniasCreated by Israeli teamMissed deadline by less than 5 daysMarketed with Intels Centrino InitiativeBased on P6 microarchitechture

  • P6 Pipeline in a NutshellDivided into three clusters (front, middle, back)In-order Front-EndOut-of-order Execution CoreRetirementEach cluster is independentI.e. if a mispredicted branch is detected in the front-end, the front-end will flush and retch from the corrected branch target, all while the execution core continues working on previous instructions

  • P6 Pipeline in a Nutshell

  • P6 Front-EndMajor units: IFU, ID, RAT, Allocator, BTB, BACFetching (IFU)Includes I-cache, I-streaming cache, ITLB, ILDNo pre-decodingBoundary markings by instruction-length decoder (ILD)Branch PredictionPredicted (speculative) instructions are markedDecoding (ID)Conversion of instructions (macro-ops) into micro-opsAllocation of Buffer Entries: RS, ROB, MOB

  • P6 Execution CoreReservation Station (RS)Waiting micro-ops ready to goSchedulerOut-of-order Execution of micro-opsIndependent execution units (EU)Must be careful about out-of-order memory accessMemory ordering buffer (MOB) interfaces to the memory subsystemRequirements for executionAvailable operands, EU, and write-back busOptimal performance

  • P6 RetirementIn-order updating of architected machine stateRe-order buffer (ROB)Micro-op retirement all or noneArchitecturally illegal to retire only partof an IA-32 instruction In-ordering handling of exceptionsLegal to handle mid-execution, but illegalto handle mid-retirement

  • PM Changes to P6Most changes made in P6 front-endAdded and expanded on P4 branch predictorMicro-ops fusionAddition of dedicated stack engine Pipeline lengthLonger than P3, shorter than P4Accommodates extra features above

  • PM Changes to P6, cont.Intel has not released the exact length of the pipeline.Known to be somewhere between the P4 (20 stage)and the P3 (10 stage). Rumored to be 12 stages.Trades off slightly lower clock frequencies (than P4) for better performance per clock, less branch prediction penalties,

  • Blue Man Group Commercial Break

  • Banias1st version77 million transistors, 23 million more than P41 MB on die Level 2 cache400 MHz FSB (quad pumped 100 MHZ)130 nm processFrequencies between 1.3 1.7 GHzThermal Design Point of 24.5 watts

    http://www.intel.com/pressroom/archive/photos/centrino.htm

  • DothanLaunched May 10, 2004140 million transistors2 MB Level 2 cache400 or 533 MHz FSBFrequencies between 1.0 to 2.26 GHzThermal Design Point of 21(400 MHz FSB) to 27 watts

    http://www.intel.com/pressroom/archive/photos/centrino.htm

  • Dothan cont.90 nm process technology on 300 mm wafer.Provide twice the capacity of the 200mm while the process dimensions double the transistor density Gate dimensions are 50nm or approx half the diameter if the influenza virusP and n gate voltages are reduced by enhancing the carrier mobility of the Si lattice by 10-20%Draws less than 1 W average power

  • BusUtilizes a split transaction deferred reply protocol64-bit widthDelivers up to 3.2 Gbps (Banis) or 4.2 Gbps (Dothan) in and out of the processorUtilizes source synchronous transfer of addresses and dataData transferred 4 times per bus clockAddresses can be delivered times per bus clock

  • Bus update in Dothan

    http://www.intel.com/technology/itj/2005/volume09issue01/art05_perf_power

  • L1 Cache64KB total 32 K instruction32 K data (4 times P4M) Write-back vs. write-through on P4In write-through cache, data is written to both L1 and main memory simultaneouslyIn write-back cache, data can be loaded without writing to main memory, increasing speed by reducing the number of slow memory writes

  • L2 cache1 2 MB 8-way set associativeEach set is divided into 4 separate power quadrants.Each individual power quadrant can be set to a sleep mode, shutting off power to those quadrantsAllows for only 1/32 of cache to be powered at any timeIncreased latency vs. improved power consumption

  • PrefetchPrefetch logic fetches data to the level 2 cache before L1 cache requests occurReduces compulsory misses due to an increase of valid data in cacheReduces bus cycle penalties

  • ScheduleP6 Pipeline in detailFront-EndExecution CoreBack-EndPower IssuesIntel SpeedStepTesting the Featuresx86 system registersPerformance Testing

  • P6 Front-end: Instruction FetchingIA-32 Memory ManagementClassic segmented model (cannot be disabled in protected mode)Separation of code, data, and stack into "segmentsOptional pagingSegments divided into pages (typically 4KB)Additional protection to segment-protectionI.e. provides read-write protection on a page-by-page basisStage 11 (stage 1) - Selection of address for next I-cache accessSpeculation address chosen from competing sources (i.e. BTB, BAC, loop detector, etc.)Calculation of linear address from logical (segment selector + offset)Segment selector index into a table of segment descriptors, which include base address, size, type, and access right of the segmentRemember: only six segment selectors, so only six usable at a time32-bit code nowadays uses flat model, so OS can make do with only a few (typically four) segmentsIFU chooses address with highest priority and sends it to stage two

  • P6 Front-end: Instruction FetchingStage 12-13 - Accessing of cachesAccesses instruction caches with address calculated in stage oneIncludes standard cache, victim cache, and streaming bufferWith paging, consults ITLB to determine physical page number (tag bits)Without paging, linear address from stage one becomes physical addressObtains branch prediction from branch target buffer (BTB) BTB takes two cycles to complete one accessInstruction boundary (ILD) and BTB markingsStage 14 - Completion of instruction cache accessInstructions and their marks are sent to instruction buffer or steered to ID

  • P6 Front-end: Instruction Fetching

  • P6 Front-end: Instruction DecodingStage 15-16 - Decoding of IA32 Instructions Alignment of instruction bytesIdentification of the ends of up to three instructionsConversion of instructions into micro-opsStage 17 - Branch DecodingIf the ID notices a branch that went unpredicted by the BTB (i.e. if the BTB had never seen the branch before), flushes the in-order pipe, and re-fetches from the branch target Branch target calculated by BACEarly catch saves speculative instructions from being sent through the pipelineStage 21 - Register Allocation and RenamingSynonymous with stage 17 (a reminder of independent working units)Allocator used to allocate required entries in ROB, RS, LB, and SBRegister Alias Table (RAT) consultedMaps logical sources/destinations to physical entries in the ROB (or sometimes RRF)Stage 22 Completion of Front-EndMarked micro-ops are forwarded to RS and ROB, where theyawait execution and retirement, respectively.

  • P6 Front-end: Instruction Decoding

  • Register Alias Table IntroductionProvides register renaming of integer and floating-point registers and flags Maps logical (architected) entries to physical entries usually in the re-order buffer (ROB) Physical entries are actually allocated by the AllocatorThe physical entry pointers become a part of the micro-ops overall state as it travels through the pipeline

  • RAT DetailsP6 is 3-way super-scalar, so the RAT must be able to rename up to six logical sources per cycleAny data dependences must be handledEx:op1) ADD EAX, EBX, ECX (dest. = EAX)op2) ADD EAX, EAX, EDX

    op3) ADD EDX, EAX, EDXInstead of making op2 wait for op1 to retire, the RAT provides data forwardingSame case for op3, but RAT must make sure that it gets the result from op2 and not op1

  • RAT Implementation DifficultiesSpeculative RenamingSince speculative micro-ops flow by, the RAT must be able to undo its mappings in the case of a branch mispredictionPartial-width register reads and writesConsider a partial-width write followed by a larger-width readData required by the read is an assimilation of multiple previous writes to the register to make sure, RAT must stall the pipelineRetirement OverridesCommon interaction between RAT and ROBWhen a micro-op retires, its ROB entry is removed and its result may be latched into an architected destination registerIf any active micro-ops source the retired ops destination, they must not reference the outdated ROB entryMismatch stallsAss

View more >