Download ppt - Intel Pentium M

Intel Pentium MIntel Pentium M

OutlineOutline HistoryHistory P6 Pipeline in detailP6 Pipeline in detail New featuresNew features

Improved Branch Improved Branch PredictionPrediction

Micro-ops fusionMicro-ops fusion Speed Step Speed Step

technologytechnology Thermal Throttle 2Thermal Throttle 2

Power and Power and PerformancePerformance

Quick Review of x86Quick Review of x86 8080 - 8-bit8080 - 8-bit 8086/8088 - 16-bit (8088 had 8-bit external data bus)8086/8088 - 16-bit (8088 had 8-bit external data bus)

- segmented memory model - segmented memory model 286286

- introduction of protected mode, which included: - introduction of protected mode, which included: segment limit checking, privilege levels, read- and exe-only segment segment limit checking, privilege levels, read- and exe-only segment optionsoptions

386 - 32-bit386 - 32-bit - segmented and flat memory model - segmented and flat memory model - paging - paging

486 - first pipeline486 - first pipeline - expanded the 386's ID and EX units into five-stage pipeline - expanded the 386's ID and EX units into five-stage pipeline - first to include on-chip cache - first to include on-chip cache - integrated x87 FPU (before it was a coprocessor) - integrated x87 FPU (before it was a coprocessor)

Pentium (586) - first superscalarPentium (586) - first superscalar - included two pipelines, u and v - included two pipelines, u and v - virtual-8086 mode - virtual-8086 mode - MMX soon after - MMX soon after

Pentium Pro (686 or P6) - three-way superscalar Pentium Pro (686 or P6) - three-way superscalar - dynamic execution - out-of-order execution, branch prediction, - dynamic execution - out-of-order execution, branch prediction, speculative executionspeculative execution - very successful micro-architecture - very successful micro-architecture

Pentium 2 and 3 - both P6Pentium 2 and 3 - both P6 Pentium 4 - new NetBurst architecturePentium 4 - new NetBurst architecture Pentium M - enhanced P6Pentium M - enhanced P6

Pentium Pro RootsPentium Pro Roots NexGen 586 (1994)NexGen 586 (1994)

Decomposes IA32 instructions into Decomposes IA32 instructions into simplersimplerRISC-like operations (R-ops or micro-ops)RISC-like operations (R-ops or micro-ops) Decoupled ApproachDecoupled Approach

NexGen bought by AMDNexGen bought by AMD AMD K5 (1995) – also used micro-opsAMD K5 (1995) – also used micro-ops

Intel Pentium ProIntel Pentium Pro Intel’s first use of decoupled architectureIntel’s first use of decoupled architecture

Pentium-M OverviewPentium-M Overview Introduced March 12, 2003Introduced March 12, 2003 Initially called BaniasInitially called Banias Created by Israeli teamCreated by Israeli team Missed deadline by less than 5 daysMissed deadline by less than 5 days Marketed with Intel’s Centrino Marketed with Intel’s Centrino

InitiativeInitiative Based on P6 microarchitechtureBased on P6 microarchitechture

P6 Pipeline in a NutshellP6 Pipeline in a Nutshell Divided into three clusters (front, middle, Divided into three clusters (front, middle,

back)back) In-order Front-EndIn-order Front-End Out-of-order Execution CoreOut-of-order Execution Core RetirementRetirement

Each cluster is independentEach cluster is independent I.e. if a mispredicted branch is detected in the I.e. if a mispredicted branch is detected in the

front-end, the front-end will flush and retch from front-end, the front-end will flush and retch from the corrected branch target, all while the the corrected branch target, all while the execution core continues working on previous execution core continues working on previous instructionsinstructions

P6 Pipeline in a NutshellP6 Pipeline in a Nutshell

P6 Front-EndP6 Front-End Major units: IFU, ID, RAT, Allocator, BTB, BACMajor units: IFU, ID, RAT, Allocator, BTB, BAC Fetching (IFU)Fetching (IFU)

Includes I-cache, I-streaming cache, ITLB, ILDIncludes I-cache, I-streaming cache, ITLB, ILD No pre-decodingNo pre-decoding Boundary markings by instruction-length decoder Boundary markings by instruction-length decoder

(ILD)(ILD) Branch PredictionBranch Prediction

Predicted (speculative) instructions are markedPredicted (speculative) instructions are marked Decoding (ID)Decoding (ID)

Conversion of instructions (macro-ops) into micro-opsConversion of instructions (macro-ops) into micro-ops Allocation of Buffer Entries: RS, ROB, MOBAllocation of Buffer Entries: RS, ROB, MOB

P6 Execution CoreP6 Execution Core Reservation Station (RS)Reservation Station (RS)

Waiting micro-ops ready to goWaiting micro-ops ready to go SchedulerScheduler

Out-of-order Execution of micro-opsOut-of-order Execution of micro-ops Independent execution units (EU)Independent execution units (EU) Must be careful about out-of-order memory Must be careful about out-of-order memory

accessaccess Memory ordering buffer (MOB) interfaces to the memory Memory ordering buffer (MOB) interfaces to the memory

subsystemsubsystem Requirements for executionRequirements for execution

Available operands, EU, and write-back busAvailable operands, EU, and write-back bus Optimal performanceOptimal performance

P6 RetirementP6 Retirement In-order updating of architected In-order updating of architected

machine statemachine state Re-order buffer (ROB)Re-order buffer (ROB)

Micro-op retirement – “all or none”Micro-op retirement – “all or none” Architecturally illegal to retire only partArchitecturally illegal to retire only part

of an IA-32 instruction of an IA-32 instruction In-ordering handling of exceptionsIn-ordering handling of exceptions

Legal to handle mid-execution, but illegalLegal to handle mid-execution, but illegalto handle mid-retirementto handle mid-retirement

PM Changes to P6PM Changes to P6 Most changes made in P6 front-endMost changes made in P6 front-end Added and expanded on P4 branch Added and expanded on P4 branch

predictorpredictor Micro-ops fusionMicro-ops fusion Addition of dedicated stack engine Addition of dedicated stack engine Pipeline lengthPipeline length

Longer than P3, shorter than P4Longer than P3, shorter than P4 Accommodates extra features aboveAccommodates extra features above

PM Changes to P6, cont.PM Changes to P6, cont. Intel has not released the exact length of the pipeline.Intel has not released the exact length of the pipeline. Known to be somewhere between the P4 (20 stage)Known to be somewhere between the P4 (20 stage)

and the P3 (10 stage). Rumored to be 12 stages.and the P3 (10 stage). Rumored to be 12 stages. Trades off slightly lower clock frequencies (than P4) Trades off slightly lower clock frequencies (than P4)

for better performance per clock, less branch for better performance per clock, less branch prediction penalties, …prediction penalties, …

Blue Man Group Blue Man Group Commercial BreakCommercial Break

BaniasBanias 11stst version version 77 million transistors, 77 million transistors,

23 million more than P423 million more than P4 1 MB on die Level 2 1 MB on die Level 2

cachecache 400 MHz FSB (quad 400 MHz FSB (quad

pumped 100 MHZ)pumped 100 MHZ) 130 nm process130 nm process Frequencies between Frequencies between

1.3 – 1.7 GHz1.3 – 1.7 GHz Thermal Design Point of Thermal Design Point of

24.5 watts24.5 wattshttp://www.intel.com/pressroom/archive/photos/centrino.htm

http://www.intel.com/pressroom/archive/photos/centrino.htm


DothanDothan Launched May 10, Launched May 10,

20042004 140 million transistors140 million transistors 2 MB Level 2 cache2 MB Level 2 cache 400 or 533 MHz FSB400 or 533 MHz FSB Frequencies between Frequencies between

1.0 to 2.26 GHz1.0 to 2.26 GHz Thermal Design Point Thermal Design Point

of 21(400 MHz FSB) of 21(400 MHz FSB) to 27 wattsto 27 watts




Dothan cont.Dothan cont. 90 nm process technology on 300 mm 90 nm process technology on 300 mm

wafer.wafer. Provide twice the capacity of the 200 mm Provide twice the capacity of the 200 mm

while the process dimensions double the while the process dimensions double the transistor density transistor density

Gate dimensions are 50nm or approx half Gate dimensions are 50nm or approx half the diameter if the influenza virusthe diameter if the influenza virus

P and n gate voltages are reduced by P and n gate voltages are reduced by enhancing the carrier mobility of the Si enhancing the carrier mobility of the Si lattice by 10-20%lattice by 10-20%

Draws less than 1 W average powerDraws less than 1 W average power

BusBus Utilizes a split transaction deferred reply Utilizes a split transaction deferred reply

protocolprotocol 64-bit width64-bit width Delivers up to 3.2 Gbps (Banis) or 4.2 Delivers up to 3.2 Gbps (Banis) or 4.2

Gbps (Dothan) in and out of the processorGbps (Dothan) in and out of the processor Utilizes source synchronous transfer of Utilizes source synchronous transfer of

addresses and dataaddresses and data Data transferred 4 times per bus clockData transferred 4 times per bus clock Addresses can be delivered times per bus Addresses can be delivered times per bus

clockclock

Bus update in DothanBus update in Dothan

http://www.intel.com/technology/itj/2005/volume09issue01/art05_perf_powerhttp://www.intel.com/technology/itj/2005/volume09issue01/art05_perf_power

L1 CacheL1 Cache 64KB total 64KB total

32 K instruction32 K instruction 32 K data (4 times P4M)32 K data (4 times P4M)

Write-back vs. write-through on P4Write-back vs. write-through on P4 In write-through cache, data is written to In write-through cache, data is written to

both L1 and main memory simultaneouslyboth L1 and main memory simultaneously In write-back cache, data can be loaded In write-back cache, data can be loaded

without writing to main memory, without writing to main memory, increasing speed by reducing the number increasing speed by reducing the number of slow memory writesof slow memory writes

L2 cacheL2 cache 1 – 2 MB 1 – 2 MB 8-way set associative8-way set associative Each set is divided into 4 separate power Each set is divided into 4 separate power

quadrants.quadrants. Each individual power quadrant can be set to a Each individual power quadrant can be set to a

sleep mode, shutting off power to those sleep mode, shutting off power to those quadrantsquadrants

Allows for only 1/32 of cache to be powered at Allows for only 1/32 of cache to be powered at any timeany time

Increased latency vs. improved power Increased latency vs. improved power consumptionconsumption

PrefetchPrefetch Prefetch logic fetches data to the Prefetch logic fetches data to the

level 2 cache before L1 cache level 2 cache before L1 cache requests occurrequests occur

Reduces compulsory misses due to Reduces compulsory misses due to an increase of valid data in cachean increase of valid data in cache

Reduces bus cycle penaltiesReduces bus cycle penalties

ScheduleSchedule P6 Pipeline in detailP6 Pipeline in detail

Front-EndFront-End Execution CoreExecution Core Back-EndBack-End

Power IssuesPower Issues Intel SpeedStepIntel SpeedStep

Testing the Testing the FeaturesFeatures x86 system registersx86 system registers Performance TestingPerformance Testing

IA-32 Memory ManagementIA-32 Memory Management Classic segmented model (cannot be disabled in protected mode)Classic segmented model (cannot be disabled in protected mode)

Separation of code, data, and stack into "segments“Separation of code, data, and stack into "segments“ Optional pagingOptional paging

Segments divided into pages (typically 4KB)Segments divided into pages (typically 4KB) Additional protection to segment-protectionAdditional protection to segment-protection

I.e. provides read-write protection on a page-by-page basisI.e. provides read-write protection on a page-by-page basis Stage 11 (stage 1) - Selection of address for next I-cache Stage 11 (stage 1) - Selection of address for next I-cache

accessaccess Speculation – address chosen from competing sources (i.e. BTB, Speculation – address chosen from competing sources (i.e. BTB,

BAC, loop detector, etc.)BAC, loop detector, etc.) Calculation of linear address from logical (segment selector + Calculation of linear address from logical (segment selector +

offset)offset) Segment selector – index into a table of segment descriptors, which Segment selector – index into a table of segment descriptors, which

include base address, size, type, and access right of the segmentinclude base address, size, type, and access right of the segment Remember: only six segment selectors, so only six usable at a timeRemember: only six segment selectors, so only six usable at a time

32-bit code nowadays uses flat model, so OS can make do with only a few 32-bit code nowadays uses flat model, so OS can make do with only a few (typically four) segments(typically four) segments

IFU chooses address with highest priority and sends it to stage IFU chooses address with highest priority and sends it to stage twotwo

P6 Front-end: Instruction P6 Front-end: Instruction FetchingFetching


Stage 12-13 - Accessing of cachesStage 12-13 - Accessing of caches Accesses instruction caches with address calculated in Accesses instruction caches with address calculated in

stage onestage one Includes standard cache, victim cache, and streaming bufferIncludes standard cache, victim cache, and streaming buffer

With paging, consults ITLB to determine physical page With paging, consults ITLB to determine physical page number (tag bits)number (tag bits)

Without paging, linear address from stage one becomes Without paging, linear address from stage one becomes physical addressphysical address

Obtains branch prediction from branch target buffer (BTB) Obtains branch prediction from branch target buffer (BTB) BTB takes two cycles to complete one accessBTB takes two cycles to complete one access

Instruction boundary (ILD) and BTB markingsInstruction boundary (ILD) and BTB markings Stage 14 - Completion of instruction cache accessStage 14 - Completion of instruction cache access

Instructions and their marks are sent to instruction buffer Instructions and their marks are sent to instruction buffer or steered to IDor steered to ID


P6 Front-end: Instruction P6 Front-end: Instruction DecodingDecoding

Stage 15-16 - Decoding of IA32 Instructions Stage 15-16 - Decoding of IA32 Instructions Alignment of instruction bytesAlignment of instruction bytes Identification of the ends of up to three instructionsIdentification of the ends of up to three instructions Conversion of instructions into micro-opsConversion of instructions into micro-ops

Stage 17 - Branch DecodingStage 17 - Branch Decoding If the ID notices a branch that went unpredicted by the BTB (i.e. if the If the ID notices a branch that went unpredicted by the BTB (i.e. if the

BTB had never seen the branch before), flushes the in-order pipe, and BTB had never seen the branch before), flushes the in-order pipe, and re-fetches from the branch target re-fetches from the branch target

Branch target calculated by BACBranch target calculated by BAC Early catch saves speculative instructions from being sent through the Early catch saves speculative instructions from being sent through the

pipelinepipeline Stage 21 - Register Allocation and RenamingStage 21 - Register Allocation and Renaming

Synonymous with stage 17 (a reminder of independent working units)Synonymous with stage 17 (a reminder of independent working units) Allocator used to allocate required entries in ROB, RS, LB, and SBAllocator used to allocate required entries in ROB, RS, LB, and SB Register Alias Table (RAT) consultedRegister Alias Table (RAT) consulted

Maps logical sources/destinations to physical entries in the ROB (or Maps logical sources/destinations to physical entries in the ROB (or sometimes RRF)sometimes RRF)

Stage 22 – Completion of Front-EndStage 22 – Completion of Front-End Marked micro-ops are forwarded to RS and ROB, where theyMarked micro-ops are forwarded to RS and ROB, where they

await execution and retirement, respectively.await execution and retirement, respectively.

P6 Front-end: Instruction P6 Front-end: Instruction DecodingDecoding

Register Alias Table Register Alias Table IntroductionIntroduction

Provides register renaming of integer and Provides register renaming of integer and floating-point registers and flags floating-point registers and flags

Maps logical (architected) entries to physical Maps logical (architected) entries to physical entries usually in the re-order buffer (ROB) entries usually in the re-order buffer (ROB)

Physical entries are actually allocated by the Physical entries are actually allocated by the AllocatorAllocator

The physical entry pointers become a part of the The physical entry pointers become a part of the micro-op’s overall state as it travels through the micro-op’s overall state as it travels through the pipeline pipeline

RAT DetailsRAT Details P6 is 3-way super-scalar, so the RAT must P6 is 3-way super-scalar, so the RAT must

be able to rename up to six logical sources be able to rename up to six logical sources per cycleper cycle

Any data dependences must be handledAny data dependences must be handled Ex:Ex: op1) ADD EAX, EBX, ECX (dest. = EAX)op1) ADD EAX, EBX, ECX (dest. = EAX)

op2) ADD EAX, EAX, EDXop2) ADD EAX, EAX, EDXop3) ADD EDX, EAX, EDXop3) ADD EDX, EAX, EDX

Instead of making op2 wait for op1 to retire, Instead of making op2 wait for op1 to retire, the RAT provides data forwardingthe RAT provides data forwarding

Same case for op3, but RAT must make sure that it Same case for op3, but RAT must make sure that it gets the result from op2 and not op1gets the result from op2 and not op1

RAT Implementation RAT Implementation DifficultiesDifficulties

Speculative RenamingSpeculative Renaming Since speculative micro-ops flow by, the RAT must be able to Since speculative micro-ops flow by, the RAT must be able to

undo its mappings in the case of a branch mispredictionundo its mappings in the case of a branch misprediction Partial-width register reads and writesPartial-width register reads and writes

Consider a partial-width write followed by a larger-width Consider a partial-width write followed by a larger-width readread

Data required by the read is an assimilation of multiple previous Data required by the read is an assimilation of multiple previous writes to the register – to make sure, RAT must stall the pipelinewrites to the register – to make sure, RAT must stall the pipeline

Retirement OverridesRetirement Overrides Common interaction between RAT and ROBCommon interaction between RAT and ROB When a micro-op retires, its ROB entry is removed and its When a micro-op retires, its ROB entry is removed and its

result may be latched into an architected destination result may be latched into an architected destination registerregister

If any active micro-ops source the retired op’s destination, If any active micro-ops source the retired op’s destination, they must not reference the outdated ROB entrythey must not reference the outdated ROB entry

Mismatch stallsMismatch stalls Associated with flag renamingAssociated with flag renaming

The AllocatorThe Allocator Works in conjunction with RAT to allocate required entriesWorks in conjunction with RAT to allocate required entries In each cycle, assumes three ROB, RS, and LB and two SB In each cycle, assumes three ROB, RS, and LB and two SB

entriesentries Once micro-ops arrive, it determines how many entries are Once micro-ops arrive, it determines how many entries are

really neededreally needed ROB Allocation ROB Allocation

If three entries aren’t available the allocator will stallIf three entries aren’t available the allocator will stall RS AllocationRS Allocation

A bitmap is used to determine which entries are freeA bitmap is used to determine which entries are free If the RS is full, pipeline is stalledIf the RS is full, pipeline is stalled

RS must make sure valid entries are not overwrittenRS must make sure valid entries are not overwritten MOB AllocationMOB Allocation

Allocation of LB and SB entries also done by allocatorAllocation of LB and SB entries also done by allocator

PM Changes to P6 Front-PM Changes to P6 Front-EndEnd

Micro-op fusionMicro-op fusion Dedicated Stack EngineDedicated Stack Engine Enhanced branch predictionEnhanced branch prediction Additional stagesAdditional stages

Intel’s secretIntel’s secret Most likely required for extra Most likely required for extra

functionality abovefunctionality above

Micro-ops FusionMicro-ops Fusion Fusion of multiple micro-ops into one micro-opFusion of multiple micro-ops into one micro-op

Less contention for buffer entriesLess contention for buffer entries Similarity to SIMD data packingSimilarity to SIMD data packing Two examples of fusion from Intel Two examples of fusion from Intel

documentation:documentation: IA32 load-and-operate and store instructionsIA32 load-and-operate and store instructions Not known for certain whether these are the only Not known for certain whether these are the only

cases of fusioncases of fusion Possibly inspired by MacroOps used in K7 Possibly inspired by MacroOps used in K7

(Athlon)(Athlon)

Dedicated Stack EngineDedicated Stack Engine Traditional out-of-order implementations Traditional out-of-order implementations

update the Stack Pointer Register (ESP) by update the Stack Pointer Register (ESP) by sending a µop to update the ESP register with sending a µop to update the ESP register with every stack related instructionevery stack related instruction

Pentium M implementationPentium M implementation A delta register (ESPA delta register (ESPDD) is maintained in the front ) is maintained in the front

endend A historic ESP (ESPA historic ESP (ESPOO) is then kept in the out-of-) is then kept in the out-of-

order execution coreorder execution core Dedicated logic was added to update the ESP by Dedicated logic was added to update the ESP by

adding the ESPadding the ESPOO with the ESP with the ESPDD

ImprovementsImprovements The ESPThe ESPOO value kept in the out-of-order machine value kept in the out-of-order machine

is not changed during a sequence of stack is not changed during a sequence of stack operations, this allows for more parallelism operations, this allows for more parallelism opportunities to be realizedopportunities to be realized

Since ESPSince ESPDD updates are now done by a dedicated updates are now done by a dedicated adder, the execution unit is now free to work on adder, the execution unit is now free to work on other µops and the ALU’s are freed to work on other µops and the ALU’s are freed to work on more complex operationsmore complex operations

Decreased power consumption since large adders Decreased power consumption since large adders are not used for small operations and the are not used for small operations and the eliminated µops do not toggle through the eliminated µops do not toggle through the machine machine

Approximately 5% of the µops have been Approximately 5% of the µops have been eliminatedeliminated

ComplicationsComplications Since the new adder lives in the front end Since the new adder lives in the front end

all of its calculations are speculative. all of its calculations are speculative. This necessitates the addition of recovery This necessitates the addition of recovery table for all values of ESPtable for all values of ESPOO and ESP and ESPDD

If the architectural value of ESP is If the architectural value of ESP is needed inside of the out-of-order machine needed inside of the out-of-order machine the decode logic then needs to insert a the decode logic then needs to insert a µop that will carry out the ESP µop that will carry out the ESP calculationcalculation

Branch PredictionBranch Prediction Longer pipelines mean higher Longer pipelines mean higher

penalties for mispredicted branchespenalties for mispredicted branches Improvements result in added Improvements result in added

performance and hence less energy performance and hence less energy spent per instruction retiredspent per instruction retired

Branch Prediction in Branch Prediction in Pentium MPentium M

Enhanced version of Pentium 4 Enhanced version of Pentium 4 predictorpredictor

Two branch predictors added that run Two branch predictors added that run in tandem with P4 predictor: in tandem with P4 predictor: Loop detectorLoop detector Indirect branch detectorIndirect branch detector

20% lower misprediction rate than PIII 20% lower misprediction rate than PIII resulting in up to 7% gain in real resulting in up to 7% gain in real performanceperformance

Branch PredictionBranch Prediction

Based on diagram found here: http://www.cpuid.org/reviews/PentiumM/index.php

Loop DetectorLoop Detector A predictor that A predictor that

always branches in a always branches in a loop will always loop will always incorrectly branch on incorrectly branch on the last iterationthe last iteration

Detector analyzes Detector analyzes branches for loop branches for loop behaviorbehavior

Benefits a wide Benefits a wide variety of program variety of program typestypes

http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p05_branch.htm




Indirect Branch Indirect Branch PredictorPredictor

Picks targets Picks targets based on global based on global flow control flow control historyhistory

Benefits programs Benefits programs compiled to compiled to branch to branch to calculated calculated addressesaddresses





Reservation StationReservation Station Used as a store for µops to wait for their operands Used as a store for µops to wait for their operands

and execution units to become availableand execution units to become available Consists of 20 entriesConsists of 20 entries Control portion of the entry can be written to from Control portion of the entry can be written to from

one of three portsone of three ports Data portion can be written to from one of 6 Data portion can be written to from one of 6

available portsavailable ports 3 for ROB3 for ROB 3 for EU write backs3 for EU write backs

Scheduler then uses this to schedule up to 5 µops at Scheduler then uses this to schedule up to 5 µops at a timea time

During pipeline stage 31 entries that are ready for During pipeline stage 31 entries that are ready for dispatch are then sent to stage 32 dispatch are then sent to stage 32

CancellationCancellation Reservation Station assumes that all Reservation Station assumes that all

cache accesses will be hitscache accesses will be hits In the case of a cache miss micro-In the case of a cache miss micro-

ops that are dependant on the write-ops that are dependant on the write-back data need to be cancelled and back data need to be cancelled and rescheduled at a later timerescheduled at a later time

Can also occur due to a future Can also occur due to a future resource conflictresource conflict

RetirementRetirement Takes 2 clock cycles to completeTakes 2 clock cycles to complete Utilizes reorder buffer (ROB) to control retirement Utilizes reorder buffer (ROB) to control retirement

or completion of or completion of μμopsops ROB is a multi-ported register file with separate ROB is a multi-ported register file with separate

ports for ports for Allocation time writes of µop fields needed at retirementAllocation time writes of µop fields needed at retirement Execution Unit write-backsExecution Unit write-backs ROB reads of sources for the Reservation StationROB reads of sources for the Reservation Station Retirement logic reads of speculative result dataRetirement logic reads of speculative result data

Consists of 40 entries with each entry 157 bits wideConsists of 40 entries with each entry 157 bits wide The ROB participates inThe ROB participates in

Speculative executionSpeculative execution Register renamingRegister renaming Out-of-order executionOut-of-order execution

Speculative ExecutionSpeculative Execution Buffers results of the execution unit before Buffers results of the execution unit before

commitcommit Allows maximum rate for fetch and execute by Allows maximum rate for fetch and execute by

assuming that branch prediction is perfect and assuming that branch prediction is perfect and no exceptions have occurred no exceptions have occurred

If a misprediction occurs:If a misprediction occurs: Speculative results stored in the ROB are immediately Speculative results stored in the ROB are immediately

discardeddiscarded Microengine will restart by examining the committed state Microengine will restart by examining the committed state

in the ROBin the ROB

Register RenamingRegister Renaming Entries in the ROB that will hold the Entries in the ROB that will hold the

results of speculative µops are allocated results of speculative µops are allocated during stage 21 of the pipelineduring stage 21 of the pipeline

In stage 22 the sources for the µops are In stage 22 the sources for the µops are delivered based upon the allocation in delivered based upon the allocation in stage 21.stage 21.

Data is written to the ROB by the Data is written to the ROB by the Execution Unit into the renamed register Execution Unit into the renamed register during stage 83during stage 83

Out-of-order ExecutionOut-of-order Execution Allows µops to complete and write back their Allows µops to complete and write back their

results without concern for other µops executing results without concern for other µops executing simultaneouslysimultaneously

The ROB reorders the completed µops into the The ROB reorders the completed µops into the original sequence and updates the architectural original sequence and updates the architectural statestate

Entries in ROB are treated as FIFO during Entries in ROB are treated as FIFO during retirementretirement µops are originally allocated in sequential order so the µops are originally allocated in sequential order so the

retirement will also follow the original program orderretirement will also follow the original program order Happens during pipeline stage 92 and 93Happens during pipeline stage 92 and 93

Exception HandlingException Handling Events are sent to the ROB by the EU during stage 83Events are sent to the ROB by the EU during stage 83 Results sent to the ROB from the Execution Unit are speculative Results sent to the ROB from the Execution Unit are speculative

results, therefore any exceptions encountered may not be realresults, therefore any exceptions encountered may not be real If the ROB determines that branch prediction was incorrect it If the ROB determines that branch prediction was incorrect it

inserts a clear signal at the point just before the retirement of inserts a clear signal at the point just before the retirement of this operation and then flushes all the speculative operations this operation and then flushes all the speculative operations from the machinefrom the machine

If speculation is correct, the ROB will invoke the correct If speculation is correct, the ROB will invoke the correct microcode exception handlermicrocode exception handler

All event records are saved to allow the handler to repair the All event records are saved to allow the handler to repair the result or invoke the correct macro handlerresult or invoke the correct macro handler

Pointers for the macro and micro instructions are also needed to Pointers for the macro and micro instructions are also needed to allow the program to resume after completion by the event allow the program to resume after completion by the event handlerhandler

If the ROB retires an operation that faults, both the in-order and If the ROB retires an operation that faults, both the in-order and out-of-order sections are cleared. This happens during pipeline out-of-order sections are cleared. This happens during pipeline stages 93 and 94stages 93 and 94

Memory SubsystemMemory Subsystem Memory Ordering Buffer (MOB)Memory Ordering Buffer (MOB)

Execution is out-of-order, but memory accesses cannot Execution is out-of-order, but memory accesses cannot just be done in any orderjust be done in any order

Contains mainly the LB and the SBContains mainly the LB and the SB Speculative loads and storesSpeculative loads and stores

Not all loads can be speculativeNot all loads can be speculative I.e. a memory-mapped I/O ld could have unrecoverable side I.e. a memory-mapped I/O ld could have unrecoverable side

effectseffects Stores are never speculative (can’t get back Stores are never speculative (can’t get back

overwritten bits)overwritten bits) But to improve performance, stores are queued in the store But to improve performance, stores are queued in the store

buffer (SB) to allow pending loads to proceed buffer (SB) to allow pending loads to proceed Similar to a write-back cacheSimilar to a write-back cache





Power IssuesPower Issues Power use = Power use = αα * C * V * C * V22 * F * F

αα = activity factor = activity factor C = C = effective capacitance V = voltageV = voltage F = operating frequencyF = operating frequency

Power use can be reduced Power use can be reduced linearly by lowering frequency linearly by lowering frequency and capacitance and and capacitance and quadratically by scaling voltagequadratically by scaling voltage

Mobile UseMobile Use Mobile is bursty – full power is only Mobile is bursty – full power is only

necessary for brief periodsnecessary for brief periods Intel developed SpeedStep Intel developed SpeedStep

technology to take advantage of this technology to take advantage of this fact and reduce power consumption fact and reduce power consumption during periods of inactivityduring periods of inactivity

http://www.intel.com/technology/itj/2003/volume07issue02/art05_power/p05_thermal.htm

SpeedStep I and IISpeedStep I and II SpeedStep I and II used in previous SpeedStep I and II used in previous

generationsgenerations Only two states: Only two states:

High performance (High frequency mode)High performance (High frequency mode) Lower power use (Low frequency mode)Lower power use (Low frequency mode)

ProblemsProblems Slow transition timesSlow transition times Limited opportunity for optimizationLimited opportunity for optimization

Pentium M GoalsPentium M Goals Optimize for performance when plugged inOptimize for performance when plugged in Optimize for long battery-life when unpluggedOptimize for long battery-life when unplugged

Model Frequency (max / min) Vcore (max / min)

Pentium M 1,6GHz 1,6GHz / 600MHz 1,484v / 0,956v




Pentium M 1,1GHzLow Voltage 1,1GHz / 600MHz 1,180v / 0,956v

Pentium M 900MHzUltra Low Voltage 1,6GHz / 600MHz 1,004v / 0,844v

SpeedStep IIISpeedStep III Optimized to fix Optimized to fix

limitations of previous limitations of previous generationsgenerations

Three innovations: Three innovations: Voltage-Frequency

switching separation Clock partitioning and

recovery Event blocking

FreqFreq..

VoltVolt..

1.6GH1.6GHzz

1.484 1.484 VV

1.4GH1.4GHzz

1.42V1.42V

1.2GH1.2GHzz

1.276V1.276V

1GHz1GHz 1.164V1.164V800M800MHzHz

1.036V1.036V

600M600MHzHz

0.956 0.956 VV

The 6 states of the Pentium M 1,6GHz

Voltage-Frequency switching separation

Voltage scaling is Voltage scaling is stepped up and down stepped up and down incrementallyincrementally

This prevents clock This prevents clock noise and allows the noise and allows the processor to remain processor to remain responsive during responsive during transitiontransition

Once voltage target is Once voltage target is reached, frequency is reached, frequency is throttledthrottled

http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p10_speedstep.htm

Clock partitioning and recovery

During transition, During transition, only the core clock only the core clock and phase-locked-and phase-locked-loop are stoppedloop are stopped

This keeps logic This keeps logic active even while active even while the clock is stoppedthe clock is stopped


Event blocking To prevent loss of events To prevent loss of events

during frequency and during frequency and voltage scaling when the voltage scaling when the core clock is stopped, core clock is stopped, interrupts, pin events, and interrupts, pin events, and snoop requests are snoop requests are sampled and savedsampled and saved

These events are These events are retransmitted once the retransmitted once the core clock becomes core clock becomes availableavailable


LeakageLeakage Transistors in off state still draw Transistors in off state still draw

currentcurrent As transistors shrink and clock As transistors shrink and clock

speed increases, transistors leak speed increases, transistors leak more current causing higher more current causing higher temperatures and more power usetemperatures and more power use

Strained SiliconStrained Silicon

http://www.research.ibm.com/resources/press/strainedsilicon/

Benefits of Strained Benefits of Strained SiliconSilicon

Electrons flow up to 70% faster due to Electrons flow up to 70% faster due to reduced resistancereduced resistance

This leads to chips which are up to 35% This leads to chips which are up to 35% faster, without decrease in chip sizefaster, without decrease in chip size

Intel’s "uni-axial" strained silicon Intel’s "uni-axial" strained silicon process reduces leakage by at least five process reduces leakage by at least five times without reducing performance – times without reducing performance – the 65nm process will realize another the 65nm process will realize another reduction of at least four timesreduction of at least four times

High-K Transistor Gate High-K Transistor Gate Dielectric (coming soon)Dielectric (coming soon)

The dielectric used since the 1960s, The dielectric used since the 1960s, silicon dioxide, is so thin now that silicon dioxide, is so thin now that leakage is a significant problemleakage is a significant problem

A high-k (high dielectric constant) A high-k (high dielectric constant) material has been developed by Intel material has been developed by Intel to replace silicon dioxideto replace silicon dioxide

This high-k material reduces leakage This high-k material reduces leakage by a factor of 100 below silicon by a factor of 100 below silicon dioxidedioxide

More Advances to ExpectMore Advances to Expect Continued lowering of capacitance Continued lowering of capacitance

has helped reduce power has helped reduce power consumptionconsumption

Tri-gate transistors decreases Tri-gate transistors decreases leakage by increasing the amount of leakage by increasing the amount of surface area for electrons to flow surface area for electrons to flow throughthrough





x86 System Registersx86 System Registers EFLAGSEFLAGS

Various system flagsVarious system flags CPUIDCPUID

Exposes type and available features of processorExposes type and available features of processor Model Specific Registers (MSRs)Model Specific Registers (MSRs)

rdmsr and wrmsrrdmsr and wrmsr ExamplesExamples

Enabling/Disabling SpeedStepEnabling/Disabling SpeedStep Determining and changing voltage/frequency pointsDetermining and changing voltage/frequency points MoreMore

Performance TestingPerformance Testing P4 2.2GHz vs. PM 1.6GHzP4 2.2GHz vs. PM 1.6GHz

Asus L3C Pentium-M Notebook Display Size 15.1" 14.1" Display Resolution

1400x1050 1024x768

CPU P4-M-2.2GHz Pentium-M 1.6GHZ Memory Type PC2100 DDR SDRAM PC2100 DDR SDRAM Amount of Memory

256 MB 256 MB

Chipset Northbridge

845MP "Odem" 855PM

Chipset Southbridge

ICH3-M ICH4-M

Graphics Controller

Ati Mobility Radeon 7500 (LW)/M7 32MB DDR NVIDIA GeForce4 440 Go 64MB DDR

CD/DVD ROM Toshiba SDR2102 (ATA-2) 8x/8x8x24xDVD/CDRW Combo

XX-XXXX (ATA-2) 8x/8x8x24xDVD/CDRW Combo

Harddisc IBM Travelstar IC25N020ATCS05-0 ATA-5 20GB/5400rpm/8MB

IBM Travelstar IC25N020ATCS05-0 ATA-5 20GB/5400rpm/8MB

Hard drive bay 2.5", 12.5 mm height 2.5", 12.5 mm height Ethernet Realtek RTL8139 (10/100 Mbit) 3Com 3C920 (10/100 Mbit) Modem HSP 56MR LT56 ATW Audio Intel AC97 Crystal AC97 Battery Capacity 59 Wh 49 Wh

BenchmarkBenchmark

Battery LifeBattery Life

Pentium M vs AMD Pentium M vs AMD TurionTurionSpecifications Ferrari 4005 TravelMate 8104

Processor AMD Turion 64 Mobile ML-37 (2.0 GHz, 1MB L2 Cache)

Intel Pentium M Processor 760 (2.0 GHz, 2MB L2 Cache)

FSB/ HTT 1600MHz 533 MHz Chipset ATI Radeon Xpress 200M Intel 915 PM Express

Wireless LAN Broadcom 802.11b/g with

SpeedBooster Bluetooth Wireless

IrDA

Intel PRO/Wireless 2915ABG (802.11a/b/g)

Bluetooth Wireless IrDA

LCD 15.4” WSXGA+ TFT LCD (1680x1050)

15.4” WSXGA+ TFT LCD (1680x1050)

Hard Drive 100GB Seagate Momentus

5400RPM 8MB Cache (ST9100823A)

100GB Seagate Momentus 5400RPM 8MB Cache

(ST9100823A)

Memory 1GB DDR400 SDRAM

(2 x 512MB) on Single-Channel Mode

2.5-3-3-7

1GB DDR2-533 SDRAM (2 x 512MB) on

Dual-Channel Mode 4-4-4-12

Graphics ATI Mobility Radeon X700 128MB

PCI-E (358 core/345 mem)

Driver version 6.14.10.6546

ATI Mobility Radeon X700 128MB PCI-E (358 core/345 mem)

Driver version 6.14.10.6546 Graphics I nterface S-Video/TV-out/DVI-D S-Video/TV-out/DVI-D

Optical Drive Slot-Load DVD-RW Super-Multi Double Layer

Tray-Load DVD-RW Super-Multi Double Layer

Audio Realtek AC' 97 Realtek High Definition

Audio I nterface Microphone, two stereo speakers, headphone/ line-out with SPDIF

support Microphone, two stereo speakers, headphone/ line-out with SPDIF

support Weight 6.3 lbs. with 8-cell battery 6.3 lbs. with 8-cell battery Size (W x D x H) 14.3” x 10.5” x 1.2”-1.4” 14.3” x 10.5” x 1.2”-1.4” Operating System Windows XP Professional w/SP2 Windows XP Professional w/SP2 Battery 4,800 mAh 4,800 mAh

GamingGaming

Battery LifeBattery Life

Future ProcessorsFuture Processors YonahYonah

Dual-core processor Dual-core processor Manufactured on a 65 nm processManufactured on a 65 nm process Starting at 2.16GHz with a 667 MHz FSB (166MHz quad-Starting at 2.16GHz with a 667 MHz FSB (166MHz quad-

pumped)pumped) Shared 2MB L2 cacheShared 2MB L2 cache Increased floating point performance with SSE3 Increased floating point performance with SSE3

instructionsinstructions

MeromMerom Based on EM64T ISABased on EM64T ISA Consume ~0.5 W of power, half of what the Dothan Consume ~0.5 W of power, half of what the Dothan

consumesconsumes Possibility of laptops with 10 hours of battery lifePossibility of laptops with 10 hours of battery life