Click here to load reader

Intel Pentium 4

  • View

  • Download

Embed Size (px)


Intel Pentium 4. ENCM 515 - 2002 Jonathan Bienert Tyson Marchuk. Overview:. Product review Specialized architectural features (NetBurst) SIMD instructional capabilities (MMX, SSE2) SHARC 2106x comparison. Intel Pentium 4. Reworked micro-architecture for high-bandwidth applications - PowerPoint PPT Presentation

Text of Intel Pentium 4

  • Intel Pentium 4ENCM 515 - 2002Jonathan BienertTyson Marchuk

  • Overview:Product reviewSpecialized architectural features (NetBurst)SIMD instructional capabilities (MMX, SSE2)SHARC 2106x comparison

  • Intel Pentium 4 Reworked micro-architecture for high-bandwidth applicationsInternet audio and streaming video, image processing, video content creation, speech, 3D, CAD, games, multi-media, and multi-tasking user environmentsThese are DSP intensive applications! What about uses other than in PC?

  • Hardware Features:(NetBurst micro-architecture)Hyper pipelined technologyAdvanced dynamic executionCache (data, L1, L2)Rapid ALU execution engines400 MHz busOOEMicrocode ROM

  • Hyper Pipeline20-stage pipeline!!!breaks down complex CISC instructionssub-stages mimic RISCfaster execution

  • Filling the pipeline...Review of next 126 instructions to be executed

    Branch predictionif mispredict must flush 20-stage pipeline!!!branch target buffer (BTB)4K branch history table (BHT)assembly instruction hints

  • Cache8KB Data CacheL1 Execution Trace Cache12K of previous micro-instructions storedsaves having to translateL2 Advanced Transfer Cache256K for data256-bit transfer every cycleallows 77GB/s data transfer on 2.4GHz

  • Rapid ALU Execution Engines2 ALUsallow parallel operationsMany arithmetic operations take 1/2 cycleeach 2X ALU can have 2 operations per cycle

  • Software Features:Multimedia Extensions (MMX)8 MMX registersStreaming SIMD Extensions (SSE2)8 SSE/SSE2 registersStandard x86 RegistersEAX, EBX, ECX, EDX, ESI, etc.Register rename to over 100

  • MMX (Multimedia Extensions)Accelerated performance through SIMD multimedia, communication, internet applications64-bit packed INTEGER datasigned/unsigned

  • SSE2 (Streaming SIMD Extensions)Accelerate a broad range of applicationsvideo, speech, and image, photo processing, encryption, financial, engineering, and scientific applications128-bit SIMD instruction formats4 single precision FP values 2 double precision FP values 16 byte values 8 word values 4 double word values 2 quad word values 1 128-bit integer value

  • SIMD Example(16-tap FIR filter - Real numbers)Applications for real FIR filtersgeneral purpose filters in image processing, audio, and communication algorithms

    Will utilize SSE2 SIMD instruction set

  • Thinking about SIMD SSE2 instruction format is 128-bits128-bit SSE2 registersMany data formats!What precision do we want?

    Lets use 32-bit floating point for coefficients, input, output4 data sets x 32-bit = 128 bits

  • ParallelizingRequire many single multiplications (coefficients x inputs), then add the results for output!Multiplications

    then need to perform additions...

  • Using SSE2 formatCan hold 4 elements of an array (of 32-bit data) in each 128-bit register4 single precision floating point ops per cycle (32-bit)

  • Additions... In both registers, now have 4 32-bit resultsFirst add the results into an accumulator register4 single precision floating point ops per cycle (32-bit)

  • Additions... In a register, now have 4 32-bit resultshowever, NO SSE2 instruction to add these 4!But can use other instructionsSome BIT INTERTWININGthen addThis will give results for several output values!

  • ADI SHARC 21k vs. P4DisadvantagesSlower clock speed (40MHz vs 2400MHz)Less opportunities for parallelism (5 vs 11)Much less memory (Cache and System)Limited algorithm applicabilityLimited applicationsOlder (Less support compiler)1994 vs 2001

  • ADI Sharc 21k vs. P4AdvantagesHardware loopsEasier to program for optimal speedCheaperLower power consumptionRuns cooler

  • FIR PerformanceHard to obtain P4 performance numbersCan estimate based on 2 FP multiplies per clock, clock rate and assumption that pipeline can be kept full.2 * 2.4GHz ~ 4.8 billion multiplies per secondIf ~4 multiplies per element & 44000 samples/sFIR length > ~25k tapsSHARC => ~ 200 taps (Lab 4)Factor of ~125x

  • IIR PerformanceHard to obtain P4 performance numbersNo hardware circular buffersDoes have BTB, BHT, etc.Prefetches ~256bytes ahead of current position in code.

  • FFT PerformanceHard to obtain P4 performance numbersPrime95 uses FFT to calculate Lucas-Lehmer test for Mersenne PrimesInvolves FFT, squaring and iFFT, etc.256k points on P4 2.3GHz ~ 10.517msCompare to SHARC 2048 point FFT ~0.37msIf SHARC could do 256k, 46.25ms (But)

  • Optimization ExampleHard to optimize Pentium 4 assemblyExample of multiplying by a constant, 10Taken mainly from:

  • Multiplying by 10Slowest way: IMUL EAX, 10Usually optimal way (Visual C++ 6.0)LEA EAX, [EAX+EAX*4]SHL EAX, 1Shift Add ShiftOn most x86 processors takes 2 cyclesPentium MMX and before 3 cyclesOn Pentium 4 takes 6 cycles!

  • Multiplying by 10Optimal for Pentium 4LEA ECX, [EAX + EAX]LEA EAX, [ECX+EAX*8]On most x86 still takes 2 cyclesOn Pentium 4 takes ~ 3 cycles (OOE - Ops)But on older processors Pentium MMX and before this now takes 4 cycles!

  • Multiplying by 10Best generic caseLEA EAX, [EAX + EAX*4]ADD EAX, EAXOn most x86 still takes 2 cyclesOn older processors Pentium MMX and before this now takes 3 cycles againOn Pentium 4 this takes 4 cyclesObviously really hard to optimize

  • REFERENCESIntel application note: AP 809 - Real and Complex Filter Using Streaming SIMD Extentionsgraphics from:

Search related