Click here to load reader

Intel Pentium ® M processor Lihu Rappoport, 12/2004 1 MAMAS – Computer Architecture Pentium® M Processor Based on The Intel ® Pentium ® M Processor: Microarchitecture

  • View
    221

  • Download
    2

Embed Size (px)

Text of Intel Pentium ® M processor Lihu Rappoport, 12/2004 1 MAMAS – Computer Architecture...

  • Slide 1
  • Intel Pentium M processor Lihu Rappoport, 12/2004 1 MAMAS Computer Architecture Pentium M Processor Based on The Intel Pentium M Processor: Microarchitecture and Performance Intel Technology Journal Q2/2003 http://developer.intel.com/technology/itj/ Dr. Lihu Rappoport
  • Slide 2
  • Intel Pentium M processor Lihu Rappoport, 12/2004 2 Intel Centrino Mobile Technology Comprised of Pentium M processor Mobile chipset Wireless Network connection Enables Integrated wireless LAN capability Highest mobile performance Extended battery life Thinner, lighter designs Intel Pro/Wireless 2100 Network Connection ICH4-MICH4-M Intel 855 Chipset Family Intel Pentium M Processor Intel Pentium M Processor
  • Slide 3
  • Intel Pentium M processor Lihu Rappoport, 12/2004 3 The Intel Pentium M processor Intels first microprocessor designed specifically for mobility Achieve best performance at given power and thermal constraints Different power/perf tradeoffs than a traditional high-performance processor Achieve longest battery life Power dissipation Power generates heat Transistors must be kept within their allowed operating temperature range Heat has to be dissipated in a cost-effective manner Limit the processors peak power consumption Applies both to desktops and mobile computers Mobile computers smaller form-factor and lighter weight decrease the mobile processors power budget Battery life Batteries are designed to support a certain Watts Hours Higher average power shorter battery life Limits the processors average power consumption Crucial factor for mobile computers, but less relevant for desktop computers
  • Slide 4
  • Intel Pentium M processor Lihu Rappoport, 12/2004 4 Pentium M BaniasDothan transistors77M140M process130nm90nm Die size84 mm 2 85mm 2 Peak power24.5 watts21 watts Freq1.7 GHz2.1GHz L1 cache32KB I$ + 32KB D$ L2 cache1MB2MB
  • Slide 5
  • Intel Pentium M processor Lihu Rappoport, 12/2004 5 Dothan Die 6.6 mm 12.5 mm
  • Slide 6
  • Intel Pentium M processor Lihu Rappoport, 12/2004 6 Higher Performance vs. Longer Battery Life Processor average power is
  • Intel Pentium M processor Lihu Rappoport, 12/2004 19 Uop Fusion Out-of-order implementations IA32 break instructions into uops A conventional uop consists of a single operation operating on two sources The Instruction Decoder breaks an instruction into multiple uops whenever the instruction operates on more than two sources, or when the nature of the operation requires a sequence of operations Splitting the instruction into multiple uops also has its toll The increased number of uops creates pressure on resources with limited bandwidth (rename, retire) or limited capacity (ROB, RS) Instructions that are decoded into >1 uop can only be decoded by decoder 0 Delivering more uops through the system increases the energy required to complete a given instruction sequence Pentium M features uop fusion The Instruction Decoder fuses two uops into one uop The fused uop is seen as 1 uop in allocation, dispatch, and retirement Fused uops are executed as non-fused operations Maintain the non-fused behavior benefits Reduce performance and energy cost while maintaining OOOE benefit Provides an effectively wider decoder, allocation, and retirement
  • Slide 20
  • Intel Pentium M processor Lihu Rappoport, 12/2004 20 Uop Fusion (cont.) The different domains in which the uop is fused and un-fused The instruction is decoded into a single fused uop by the decoder Fused uop allocated, renamed, and issued into a single entry in the ROB&RS each RS entry can accommodate up to three source operands When dispatching to the execution units The dispatcher controls the execution of each portion of the fused uop according to the readiness of its sources Each portion is treated as if it occupied the whole entry for itself Executed in the same way as a non-fused uop Exe. Units Fused uops domain RSROB Alloc / RAT Decode Un-Fused uops domain
  • Slide 21
  • Intel Pentium M processor Lihu Rappoport, 12/2004 21 Fused Store A store instruction is decoded as two independent uops store-address: calculates the address of the store store-data: stores the data into the Store Data buffer The actual write to memory is done when the store retires Separating store-data & store-address is important for mem disambiguation Allows store-address to dispatch earlier, even before the stored data is known Address conflicts resolved earlier opens the memory pipeline for other loads store-data and store-address can be issued to execution units in parallel Store-address dispatched to AGU when its sources (base and index reg) are ready Store-data is dispatched to the store data buffer unit independently, when its source operand is available Fused store can retire only after both operations complete Decoded and renamed Fused store uop Dispatch Store Address Save faults in Register File Dispatch Store Data Save faults in Register File Retire values when both operations completed
  • Slide 22
  • Intel Pentium M processor Lihu Rappoport, 12/2004 22 Fused Load-Op A load-op (read-modify) instruction consists of two uops Read the operand from an address in memory Calculates result based on 1 st operand and a register operand (and write result to register) A load-op instruction may have up to 3 register operands it must be implemented by two uops The two operations are inherently serial The Op cannot start until the Load completes The load and the op are issued serially to the relevant execution units The load is dispatched when its sources (base and index registers) are ready The op can be dispatched only after the load completes and the other operand is ready A fused load-op uop can retire only after both operations complete Decode and rename load-op instruction into fused uop Dispatch Load Save faults in Register File Dispatch Op Save faults in Register File Retire values when both operations completed
  • Slide 23
  • Intel Pentium M processor Lihu Rappoport, 12/2004 23 Uop Fusion Best of all Worlds Decoder add eax, dword ptr data Scheduler LD CacheALU OP LD OP
  • Slide 24
  • Intel Pentium M processor Lihu Rappoport, 12/2004 24 Uop Fusion Best of all Worlds Decoder add eax, dword ptr data Scheduler Cache LD + OP Micro-op fusion enables effective machine utilization LD Independent uOp OOO/Super- scalar execution ALU OP Achieving >10% of Micro-op reduction
  • Slide 25
  • Intel Pentium M processor Lihu Rappoport, 12/2004 25 Uop Fusion Performance Uop fusion reduces #uops handled by the OOO logic by >10% Increases performance by effectively widening issue, rename, and retire Biggest boost is obtained during bursts of memory operations All decoders can decode instructions (instead of only decoder 0) Practically widens the processor decode, allocation, and retirement bandwidth by a factor of three The typical performance increase of the uop fusion Integer code: 5%, most of it from Store fusion FP code: 9%, equally from the two types of fused uops Delivering less uops through the processor decreases the energy required to complete a given instruction sequence The same task is accomplished by processing fewer uops Power reduction is positive More power reduced than the power added for the uop fusion logic
  • Slide 26
  • Intel Pentium M processor Lihu Rappoport, 12/2004 26 Idle Periods Prediction Predict idle periods and instruct units to reduce power Either by shutting off their clocks or by disabling parts of their logic Resume operations seamlessly with no performance penalty Power predictor example: the Allocate stall predictor Whenever the ROB is full, the Allocator stalls the pipeline The Allocator cannot tell if the ROB will remain full on the next cycle Needs to re-evaluate the stall condition every cycle It turns out that in many cases when the ROB is full, it stays so for very long periods Predictor collects information from the ROB and other units To predict the nature of the next cycle Instruct Allocator to continue stalling and shut off its clocks
  • Slide 27
  • Intel Pentium M processor Lihu Rappoport, 12/2004 27 Execution Units Stacking Identify and activate parts of the processor needed for a specific operation EUs attached to an execution port share the same source bus wires Drive only the wires that belong to the target EU EUs are divided into a few segments (stacks) Special logic controls the data flow to each stack according to its actual destination
  • Slide 28
  • Intel Pentium M processor Lihu Rappoport, 12/2004 28 Early identification of EU width IA32 processors operate on data types with different widths Integer operations, operating on 32 bits the most common Floating-point operations, operating on 80 bits Multimedia operations, operating on 64 bits or 128 bits Toggling a wider bus and reading from a bigger register file consumes more power than is actually required Integer operations are identified in advance Narrower buses to and from the EU during dispatch and write-back Renaming logic unused for integer operations are not activated Effectively transforms the processor into a 32-bit machine Utilize only resources needed for integer operations while operating on integers
  • Slide 29
  • Intel Pentium M processor Lihu Rappoport, 12/2004 29 Backup
  • Slide 30
  • Intel Pentium M processor Lihu Rappoport, 12/2004 30 Performance loss | Performance gain Power Gain | Power Loss Performance/Power Tradeoff Zones

Search related