Intel Pentium ® M processor Lihu Rappoport, 12/2004 1 MAMAS – Computer Architecture Pentium® M Processor Based on The Intel ® Pentium ® M Processor: Microarchitecture

Intel Pentium® M processorLihu Rappoport, 12/2004 1

MAMAS – Computer Architecture

Pentium® M Processor Based on

The Intel® Pentium® M Processor:Microarchitecture and Performance

Intel Technology Journal Q2/2003http://developer.intel.com/technology/itj/

Dr. Lihu Rappoport


Intel® Centrino™ Mobile Technology

Comprised of– Pentium® M processor

– Mobile chipset

– Wireless Network connection

Enables– Integrated wireless LAN

capability

– Highest mobile performance

– Extended battery life

– Thinner, lighter designs

Intel® Intel® Pro/Wireless Pro/Wireless 2100 Network 2100 Network Connection Connection

ICH4-MICH4-MICH4-MICH4-M

Intel® 855ChipsetFamily

Intel® 855ChipsetFamily

IntelIntel®® PentiumPentium®® M M

ProcessorProcessor


The Intel Pentium® M processor

Intel’s first microprocessor designed specifically for mobility– Achieve best performance at given power and thermal constraints

Different power/perf tradeoffs than a traditional high-performance processor

– Achieve longest battery life

Power dissipation – Power generates heat

– Transistors must be kept within their allowed operating temperature range Heat has to be dissipated in a cost-effective manner

– Limit the processor’s peak power consumption Applies both to desktops and mobile computers Mobile computer’s smaller form-factor and lighter weight decrease the mobile

processor’s power budget

Battery life– Batteries are designed to support a certain Watts × Hours

– Higher average power shorter battery life

– Limits the processor’s average power consumption

– Crucial factor for mobile computers, but less relevant for desktop computers


Pentium® M

Banias Dothan

transistors 77M 140M

process 130nm 90nm

Die size 84 mm2 85mm2

Peak power 24.5 watts 21 watts

Freq 1.7 GHz 2.1GHz

L1 cache 32KB I$ + 32KB D$ 32KB I$ + 32KB D$

L2 cache 1MB 2MB


Dothan Die

6.6 mm

12.5 mm


Higher Performance vs.Longer Battery Life

Processor average power is <10% of platform– The majority of power in the platform

is consumed by other components: LCD, hard disk, memory and other

– The processor reduces power in periods of low processor activity

– The processor enters lower power states in idle periods

Even an ideal processor can extend battery life by 11% at most!

Decision:– Optimize for performance when

Active

– Optimize for battery life when idle

Display(panel + inverter)

33%

CPU10%

Power Supply10%

Intel® MCH9%

Misc.8%

GFX8%

HDD8%

CLK5%

Intel® ICH3%

DVD2%

LAN2%

Fan2%

Source: 2004 Extended Battery Life Technologies,Don J Nguyen, Intel Developer Forum, Spring 2003


Static Power

The power consumed by a processor consists of – Active power: used to switch transistors– Static power: leakage of transistors under voltage

Static power is a function of– Number of transistors and their type– Operating voltage– Die temperature

Leakage is growing dramatically– Reaching 20% in current process technology, and growing

Pentium® M reduces static power consumption– The L2 cache is built with low-leaking transistors

L2 is 2/3 of the die transistors Low-leaking transistors are slower, increasing cache access latency The significant power saved justifies the small performance loss

– Enhanced SpeedStep® technology Reduces voltage (and temperature), hence leakage, when processor

activity is low


Active Power

Power is consumed when capacitance is charged/ discharged– Changing 01 or 10

– The capacitance can be on transistors gates and on wires

Power = αCV2f– α: activity, C: capacitance, V: voltage, f: frequency

– Measured in watts

Higher power higher current and higher temperature – Peak power cannot exceed the thermal constrains

Power density– Measured in watts/cm2

– Denser power is harder to cool

– Increased every process technology generation higher power @ smaller die size


Energy & Average Power

Energy = total of all switch energy and leakage waste– Measured in either in joules or watt × hour

Average power = Total energy / Total time– Including low-activity and idle-time

Typical figures (leading edge processors)– Average power: 1W-3W

– Peak power: 20W-100W


Optimize for Performance

Goal: Maximize performance at given thermal constraints– Approximated by: Maximizing performance at given Power budget

Processor power at a given voltage V0 and Frequency f0

P0 = αCV02f0

Frequency approximated as linearly proportional to voltage

f0 = Kf × V0

Leads to cubic dependency of power on the voltage

P0 = αCV03

The test“A micro-architectural feature that gains performance or saves power

should be better than simply using voltage/frequency scaling”

It can be shown that the right Performance/Power tradeoff

1% more performance in less than 3% Power – a gain!


“Less is More”

Less instructions per task– Advanced branch prediction reduces #wrong instructions executed

Branch predictor logic consume power, but the gain is still positive

– SSE instructions reduce the number of instructions architecturally

Less uops per instruction– Uops fusion

– Dedicated stack engine

Less transistor switches per micro-op– efficient bus

– various lower-level optimizations

Less energy per transistor switch– Enhanced SpeedStep® technology

Power-awareness top to bottomPower-awareness top to bottom


Loop predictor

Pentium® M employs best-in-class branch prediction– Bimodal predictor, Global predictor, Loop detector

– Indirect branch predictor

Loop predictor: analyzes branches for loop behavior– Moving in one direction (taken or NT) a fixed number of times

– Ended with a single movement in the opposite direction

When such a branch is detected– A set of counters are allocated

– Loop predicted completely accurately

– Also for larger iteration counts thancaptured by global or local predictors

PredictionLimitCount

=

+1

0


Indirect Branch Predictor The target of indirect branches is data dependent

– Part of indirect branches still have a single target at run time

– Some have many targets E.g., case statement in a Java byte-code interpreter

Indirect branches heavily used in object-oriented code (C++, Java) became a growing source of branch mispredictions

Indirect branch is resolved at execution high misprediction penalty A dedicated indirect branch target predictor (iTA)

– Chooses targets based on a global history

– Similar to global conditional branch predictor

Initially indirect branch is allocated only in the target array (TA) If the target of an indirect branch is mispredicted by the TA

– Allocate an entry in the iTA corresponding to the global history leading to this instance of the indirect branch

– Monotonic indirect branches are still predicted by the TA

– Data-dependent indirect branches allocate as many targets as needed


Indirect Branch Predictor (cont.)

Prediction from the iTA is used if– TA indicates an indirect branch

– iTA hits for the current global history

iTA hit by itself does not qualify a branch as indirect

TargetArray

Indirect Target Predictor

Branch IP

Predicted Target

M

XU

hitindirect branch

hit

Target

HIT

Global history

Target


Dedicated Stack Engine

IA32 has HW-assisted stack management instructions– Push: ESP –= src_size; MEM[ESP] ← src;

– Pop: dst ← MEM[ESP]; ESP += src_size;

– Call: ESP –= 4; MEM[ESP] ← EIP; EIP ← addr;

– Ret: EIP ← MEM[ESP]; ESP += 4;

Sequences of such instructions are quite common– E.g., PUSHing a set of operands and then using a CALL on a

Function Call

An additional uop updates the ESP register– This uop adds or subtracts an immediate value to the ESP register


Dedicated Stack Engine Pentium ® M uses dedicated logic near the decoders to update ESP The programmer’s view of ESP (ESPP) is represented by

– ESPO – an historic ESP living in the out-of-order execution core

– ESPD – a delta maintained in the front end

ESPP := ESPO + ESPD

When a sequence of PUSHes and POPs is encountered– Accumulated delta value is passed across the decoders and updates ESPD

– ESPD value is patched into the address syllable of stack referencing uops the AGU can calculate the proper memory location referenced by ESPP


Dedicated Stack Engine

ESPD lives in the front-end its calculations are speculative

– Need to be able to recover ESPD and ESPO value in case of a flush

– A dedicated table saves ESPD value for every instruction

– ESPO maintained by the OOO core as any other general-purpose register

– ESPP can be recovered for all instructions

Either pre- or post-execution This allows for handling Faults or Traps as defined in IA32

The architectural value of ESP may be needed in the OOO core– E.g., when ESP is used in an address syllable, or: “XOR ESP,3”

– Decode logic inserts a sync uop that carries out the ESPP calculation

– Following a sync uop ESPD is cleared

the architectural value is now coherent

– A sync is not generated when the ESPD register is zero

Continued usage of ESP as a general-purpose register has no ill effects


Dedicated Stack Engine Benefits

Dependencies on ESP are removed – ESPO value used for scheduling in the out-of-order machine is not

changed during a sequence of stack operations

– The stack operations can be executed in parallel

ESPD updates are done using a small dedicated adder– Freeing the general execution units to work on other uops

Effectively increasing execution bandwidth

– Saves power: dedicated adders take less power than execution units

ESP updates uops eliminated from the out-of-order machine– Typically eliminates 5% of the uops (including the ESP sync uops)

Effectively increases decode bandwidth this is the major performance gain

Effectively increases ROB and RS size

– Saves power: eliminated uops don’t toggle bits throughout the machine Energy per instruction decreases


Uop Fusion Out-of-order implementations IA32 break instructions into uops

– A conventional uop consists of a single operation operating on two sources

The Instruction Decoder breaks an instruction into multiple uops – whenever the instruction operates on more than two sources, or

– when the nature of the operation requires a sequence of operations

Splitting the instruction into multiple uops also has its toll – The increased number of uops creates pressure on resources with limited

bandwidth (rename, retire) or limited capacity (ROB, RS)

– Instructions that are decoded into >1 uop can only be decoded by decoder 0

– Delivering more uops through the system increases the energy required to complete a given instruction sequence

Pentium® M features uop fusion – The Instruction Decoder fuses two uops into one uop

– The fused uop is seen as 1 uop in allocation, dispatch, and retirement

– Fused uops are executed as non-fused operations Maintain the non-fused behavior benefits

– Reduce performance and energy cost while maintaining OOOE benefit

Provides an effectively wider decoder, allocation, and retirement


Uop Fusion (cont.) The different domains in which the uop is fused and un-fused

– The instruction is decoded into a single fused uop by the decoder

– Fused uop allocated, renamed, and issued into a single entry in the ROB&RS each RS entry can accommodate up to three source operands

When dispatching to the execution units– The dispatcher controls the execution of each portion of the fused uop

according to the readiness of its sources

– Each portion is treated as if it occupied the whole entry for itself Executed in the same way as a non-fused uop

Exe.Units

Fused uops domain

RS ROB

Alloc / RAT

Decode

Un-Fused uops domain


Fused Store A store instruction is decoded as two independent uops

– store-address: calculates the address of the store– store-data: stores the data into the Store Data buffer

The actual write to memory is done when the store retires Separating store-data & store-address is important for mem disambiguation

– Allows store-address to dispatch earlier, even before the stored data is known– Address conflicts resolved earlier opens the memory pipeline for other loads

store-data and store-address can be issued to execution units in parallel– Store-address dispatched to AGU when its sources (base and index reg) are ready– Store-data is dispatched to the store data buffer unit independently, when its source

operand is available Fused store can retire only after both operations complete

Decoded and renamed Fused store uop

Dispatch Store Address Save faults in Register File

Dispatch Store DataSave faults in Register File

Retire values when both operations completed


Fused Load-Op A load-op (read-modify) instruction

consists of two uops– Read the operand from an address in memory

– Calculates result based on 1st operand and a register operand (and write result to register)

A load-op instruction may have up to 3 register operands

– it must be implemented by two uops

The two operations are inherently serial– The Op cannot start until the Load completes

The load and the op are issued serially to the relevant execution units

– The load is dispatched when its sources (base and index registers) are ready

– The op can be dispatched only after the load completes and the other operand is ready

A fused load-op uop can retire only after both operations complete

Decode and rename load-op instruction into fused uop

Dispatch LoadSave faults in Register File

Dispatch OpSave faults in Register File

Retire values when both operations completed


Uop Fusion – Best of all Worlds

Decoder

add eax, dword ptr data

Scheduler

LD

Cache ALU

OP

LD

OP


Uop Fusion – Best of all Worlds

Decoder

add eax, dword ptr data

Scheduler

Cache

LD + OP

LD + OP

Micro-op fusion enables effective

machine utilization

LD

Independent uOp OOO/Super-

scalar execution

ALUOP

Achieving >10% of Micro-op reductionAchieving >10% of Micro-op reduction


Uop Fusion Performance

Uop fusion reduces #uops handled by the OOO logic by >10%– Increases performance by effectively widening issue, rename, and retire

Biggest boost is obtained during bursts of memory operations– All decoders can decode instructions (instead of only decoder 0)

– Practically widens the processor decode, allocation, and retirement bandwidth by a factor of three

The typical performance increase of the uop fusion – Integer code: 5%, most of it from Store fusion

– FP code: 9%, equally from the two types of fused uops

Delivering less uops through the processor decreases the energy required to complete a given instruction sequence – The same task is accomplished by processing fewer uops

Power reduction is positive– More power reduced than the power added for the uop fusion logic


Idle Periods Prediction

Predict idle periods and instruct units to reduce power – Either by shutting off their clocks or by disabling parts of their logic

– Resume operations seamlessly with no performance penalty

Power predictor example: the Allocate stall predictor– Whenever the ROB is full, the Allocator stalls the pipeline

– The Allocator cannot tell if the ROB will remain full on the next cycle Needs to re-evaluate the stall condition every cycle

– It turns out that in many cases when the ROB is full, it stays so for very long periods

– Predictor collects information from the ROB and other units To predict the nature of the next cycle Instruct Allocator to continue stalling and shut off its clocks


Execution Units Stacking

Identify and activate parts of the processor needed for a specific operation– EU’s attached to an execution port share the same source bus wires

– Drive only the wires that belong to the target EU

– EU’s are divided into a few segments (stacks) Special logic controls the data flow to each stack according to its

actual destination


Early identification of EU width IA32 processors operate on data types with different widths

– Integer operations, operating on 32 bits – the most common– Floating-point operations, operating on 80 bits– Multimedia operations, operating on 64 bits or 128 bits

Toggling a wider bus and reading from a bigger register file consumes more power than is actually required

Integer operations are identified in advance – Narrower buses to and from the EU during dispatch and write-back– Renaming logic unused for integer operations are not activated

Effectively transforms the processor into a 32-bit machine – Utilize only resources needed for integer operations while operating on integers


Backup


-60%

-40%

-20%

0%

20%

40%

60%

80%

100%

<= Performance loss | Performance gain =>

Energy LossConstrained Perf Loss

Wrong trade-off zone

Energy LossConstrained Perf

Gain

Energy GainConstrained Perf

Loss


Gain

Constrained-Performance

Breakeven line

Energy Breakeven

line

<=

Po

wer

Gai

n

| P

ow

er L

os

s =

>

-60%

-40%

-20%

0%

20%

40%

60%

80%

100%

<= Performance loss | Performance gain =>

Energy LossConstrained Perf Loss

Wrong trade-off zone

Energy LossConstrained Perf

Gain


Loss


Gain

Constrained-Performance

Breakeven line

Energy Breakeven

line

<=

Po

wer

Gai

n

| P

ow

er L

os

s =

>

Performance loss | Performance gain

P

ow

er G

ain

| P

ow

er L

oss

Performance/Power Tradeoff Zones


The Pentium® M Bus Power saving is achieved by protocol and circuit methods The bus supports 100MHz bus clock with a data rate of 400M transfers/sec It is a latched bus with an in-order queue of 8-pipelined transactions The bus is optimized for a mobile-processor environment

– Support only uni-processor Mobile systems power budget cannot support dual processors anyway

– Only 32 address bits that cover 4GB of physical address space The Bus saves power aggressively when idle

– controls its input buffer’s sense-amplifiers that sample the activity on the bus– When the bus is idle, sense amplifiers are disabled and do not consume any power– When the bus is active and address and data are driven on the bus, the input buffers

are enabled in advance to ensure all information is captured with no delay Data Bus Power Control Signal (DPWR#)

– driven by the 855PM chipset whenever data are transferred to the processor– DPWR# is used to dynamically enable the processor’s 64-bit data bus input sense

amplifiers and their related controls (~80 signals) only when data are transferred to the bus

BPRI Control– This is a method to achieve the DPWR# functionality for the address bus– BPRI# is asserted whenever the 855PM chipset attempts to drive the bus.

Used to dynamically enable the 32-bit address bus input sense amplifiers and their related controls (~40 signals) only when a transaction is issued to the bus


The Pentium® M Bus

Low Vtt: – The processor’s I/O buffers work at a low voltage of 1.05V (Vtt).

– The low Vtt is an essential element to reduce the bus power.

– Operating at low Vtt introduces a new set of problems The I/O buffer is working at the low linear point, which affects the buffer’s

characteristics.

– The bus includes a special Resistor Compensation (RCOMP) method to adjust the buffer strength dynamically during run time

– Accommodates the impacts of temperature, voltage drift, and bus topology

– At any thermal and power state the bus has full impedance termination

– It has split power planes that allow setting the I/O operating voltage to a fixed value of 1.05V even though the core may be operating at a higher Enhanced Intel SpeedStepTM technology operating point.

PSI: Power Status Indicator– Driven by the processor to control the current consumption of the Voltage

Regulator when the processor operates at a low power state

– Reduces the overall platform power (not just the processor power!)


Enhanced SpeedStep™ Technology

The “Basic” SpeedStep™ Technology had– 2 operating points – Non-transparent switch

The “Enhanced” version provides– Multi voltage/frequency operating points. The Pentium M processor 1.6GHz operation ranges:

From 600MHz @ 0.956V To 1.6GHz @ 1.484V

– Transparent switch– Frequent switches

Benefits– Higher power efficiency

2.7X lower frequency 2X performance loss >2X energy gain

– Outstanding battery life– Excellent thermal mgmt.

Voltage, Frequency, Power

0.0

0.4

0.8

1.2

1.6

2.0

2.4

2.8

3.2

3.6

4.0

0.8 1.0 1.2 1.4 1.6Voltage (Volt)

Fre

qu

ency

( GH

z

)

0

2

4

6

8

10

12

14

16

18

20

Ty

pic

al P

ow

er

( Wa

tts

)

Freq (GHz)

Power (Watts)

2.7X2.7X

6.1X6.1X

Efficiency ratio = 2.3


Voltage, Power, Frequency Transistor switches faster at higher voltage

higher voltage enables higher frequency Maximum frequency grows about linearly with voltage.

…Within a given voltage range Vmin-Vmax.– V < Vmin

transistors won’t switch.– V > Vmax

the device may burn. “The cube law”:

P kV3

(or ~1%V = 3%P) Implications

– Can save energy/power whenPerformance is not a factor

* Source: Intel Corp. (http://developer.intel.com)* Source: Intel Corp. (http://developer.intel.com)

XScale processor freq. & power vs. voltage *

0

100

200

300

400

500

600

700

800

900

1000

0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

Fequency(Mhz)

Power (mWatt )

Documents

Intel Pentium ® M processor Lihu Rappoport, 12/2004 1 MAMAS – Computer Architecture Pentium® M Processor Based on The Intel ® Pentium ® M Processor: Microarchitecture