Click here to load reader

Intel Pentium 4 Processor

  • View

  • Download

Embed Size (px)

Text of Intel Pentium 4 Processor

  • Intel Pentium 4 Processor

    Presented by Michele Co

    (much slide content courtesy of Zhijian Lu and Steve Kelley)

  • OutlineIntroduction (Zhijian)

    Willamette (11/2000)Instruction Set Architecture (Zhijian)Instruction Stream (Steve)Data Stream (Zhijian)What went wrong (Steve)Pentium 4 revisions

    Northwood (1/2002)Xeon (Prestonia, ~2002)Prescott (2/2004)Dual Core


  • IntroductionIntel Pentium 4 processor

    Latest IA-32 processor equipped with a full set of IA-32 SIMD operationsFirst implementation of a new micro-architecture called NetBurst by Intel (11/2000)

  • IA-32Intel architecture 32-bit (IA-32)

    80386 instruction set (1985)CISC, 32-bit addressesFlat memory model Registers

    Eight 32-bit registersEight FP stack registers6 segment registers

  • IA-32 (contd)Addressing modes

    Register indirect (mem[reg])Base + displacement (mem[reg + const])Base + scaled index (mem[reg + (2scale x index)])Base + scaled index + displacement (mem[reg + (2scale x index) + displacement])SIMD instruction sets

    MMX (Pentium II)Eight 64-bit MMX registers, integer ops onlySSE (Streaming SIMD Extension, Pentium III)Eight 128-bit registers

  • Pentium III vs. Pentium 4 Pipeline

  • Comparison Between Pentium3 and Pentium4

  • Execution on MPEG4 Benchmarks @ 1 GHz

  • Instruction Set ArchitecturePentium4 ISA =

    Pentium3 ISA + SSE2 (Streaming SIMD Extensions 2)

    SSE2 is an architectural enhancement to the IA-32 architecture

  • SSE2Extends MMX and the SSE extensions with 144 new instructions:128-bit SIMD integer arithmetic operations128-bit SIMD double precision floating point operationsEnhanced cache and memory management operations

  • Comparison Between SSE and SSE2Both support operations on 128-bit XMM register SSE only supports 4 packed single-precision floating-point valuesSSE2 supports more:

    2 packed double-precision floating-point values 16 packed byte integers 8 packed word integers 4 packed doubleword integers 2 packed quadword integers Double quadword

    Word=2 bytes

  • Packing128 bits (word = 2 bytes)

    Quad wordQuad word

    Double wordDouble wordDouble wordDouble word64 bit64 bit32 bit32 bit32 bit32 bit

  • Hardware Support for SSE2Adder and Multiplier units in the SSE2 engine are 128 bits wide, twice the width of that in Pentium3Increased bandwidth in load/store for floating-point values

    load and store are 128-bit wideOne load plus one store can be completed between XMM register and L1 cache in one clock cycle

  • SSE2 Instructions (1)Data movements

    Move data between XMM registers and between XMM registers and memoryDouble precision floating-point operations

    Arithmetic instructions on both scalar and packed valuesLogical Instructions

    Perform logical operations on packed double precision floating-point values

  • SSE2 Instructions (2)Compare instructions

    Compare packed and scalar double precision floating-point valuesShuffle and unpack instructions

    Shuffle or interleave double-precision floating-point values in packed double-precision floating-point operandsConversion Instructions

    Conversion between double word and double-precision floating-point or between single-precision and double-precision floating-point values

  • SSE2 Instructions (3)Packed single-precision floating-point instructions

    Convert between single-precision floating-point and double word integer operands128-bit SIMD integer instructions

    Operations on integers contained in XMM registersCacheability Control and Instruction Ordering

    More operations for caching of data when storing from XMM registers to memory and additional control of instruction ordering on store operations

  • ConclusionPentium4 is equipped with the full set of IA-32 SIMD technology. All existing software can run correctly on it.AMD has decided to embrace and implement SSE and SSE2 in future CPUs

  • Instruction Stream

  • Instruction StreamWhats new?

    Added Trace CacheImproved branch predictorTerminology

    op Micro-op, already decoded RISC-like instructionsFront end instruction fetch and issue

  • Front EndPrefetches instructions that are likely to be executedFetches instructions that havent been prefetchedDecodes instruction into mopsGenerates mops for complex instructions or special purpose codePredicts branches

  • PrefetchThree methods of prefetching:

    Instructions only HardwareData only SoftwareCode or data Hardware

  • DecoderSingle decoder that can operate at a maximum of 1 instruction per cycleReceives instructions from L2 cache 64 bits at a timeSome complex instructions must enlist the help of the microcode ROM

  • Trace CachePrimary instruction cache in NetBurst architectureStores decoded mops~12K capacityOn a Trace Cache miss, instructions are fetched and decoded from the L2 cache

  • What is a Trace Cache?I1 I2 br r2, L1I3 I4 I5 L1: I6 I7 Traditional instruction cache

    Trace cache


  • Pentium 4 Trace CacheHas its own branch predictor that directs where instruction fetching needs to go next in the Trace CacheRemoves

    Decoding costs on frequently decoded instructionsExtra latency to decode instructions upon branch mispredictions

  • Microcode ROMUsed for complex IA-32 instructions (> 4 mops) , such as string move, and for fault and interrupt handlingWhen a complex instruction is encountered, the Trace Cache jumps into the microcode ROM which then issues the mopsAfter the microcode ROM finishes, the front end of the machine resumes fetching mops from the Trace Cache

  • Branch PredictionPredicts ALL near branches

    Includes conditional branches, unconditional calls and returns, and indirect branches

    Does not predict far transfers

    Includes far calls, irets, and software interrupts

  • Branch PredictionDynamically predict the direction and target of branches based on PC using BTBIf no dynamic prediction is available, statically predict

    Taken for backwards looping branchesNot taken for forward branchesTraces are built across predicted branches to avoid branch penalties

  • Branch Target BufferUses a branch history table and a branch target buffer to predictUpdating occurs when branch is retired

  • Return Address Stack16 entriesPredicts return addresses for procedure callsAllows branches and their targets to coexist in a single cache line

    Increases parallelism since decode bandwidth is not wasted

  • Branch HintsP4 permits software to provide hints to the branch prediction and trace formation hardware to enhance performanceTake the forms of prefixes to conditional branch instructionsUsed only at trace build time and have no effect on already built traces

  • Out-of-Order ExecutionDesigned to optimize performance by handling the most common operations in the most common context as fast as possible126 mops can in flight at once

    Up to 48 loads / 24 stores

  • IssueInstructions are fetched and decoded by translation engineTranslation engine builds instructions into sequences of mopsStores mops to trace cacheTrace cache can issue 3 mops per cycle

  • ExecutionCan dispatch up to 6 mops per cycleExceeds trace cache and retirement mop bandwidth

    Allows for greater flexibility in issuing mops to different execution units

  • Execution Units


  • Double-pumped ALUsALU executes an operation on both rising and falling edges of clock cycle

  • RetirementCan retire 3 mops per cyclePrecise exceptionsReorder buffer to organize completed mopsAlso keeps track of branches and sends updated branch information to the BTB

  • Execution Pipeline

  • Execution Pipeline

  • Data Stream of Pentium 4 Processor

  • Register Renaming

  • Register Renaming (2)8-entry architectural register file128-entry physical register file2 RAT

    Frontend RAT and Retirement RATData does not need to be copied between register files when the instruction retires

  • On-chip CachesL1 instruction cache (Trace Cache)

    L1 data cache L2 unified cacheParameters:

    All caches are not inclusive and a pseudo-LRU replacement algorithm is used

  • L1 Instruction CacheExecution Trace Cache stores decoded instructionsRemove decoder latency from main execution loopsIntegrate path of program execution flow into a single line

  • L1 Data CacheNonblocking

    Support up to 4 outstanding load missesLoad latency

    2-clock for integer 6-clock for floating-point1 Load and 1 Store per clockSpeculation Load

    Assume the access will hit the cacheReplay the dependent instructions when miss happen

  • L2 CacheLoad latency

    Net load access latency of 7 cyclesNonblockingBandwidth

    One load and one store in one cycleNew cache operation begin every 2 cycles256-bit wide bus between L1 and L248Gbytes per second @ 1.5GHz

  • Data Prefetcher in L2 CacheHardware prefetcher monitors the reference patternsBring cache lines automaticallyAttempt to stay 256 bytes ahead of current data access locationPrefetch for up to 8 simultaneous independent streams

  • Store and LoadOut of order store and load operations

    Stores are always in program order48 loads and 24 stores can be in flightStore buffers and load buffers are allocated at the allocation stage

    Total 24 store buffers and 48 load buffers

  • StoreStore operations are divided into two parts:

    Store dataStore addressStore data is dispatched

Search related