Intel Pentium 4ENCM 515 - 2002Jonathan BienertTyson Marchuk
Overview:Product reviewSpecialized architectural features (NetBurst)SIMD instructional capabilities (MMX, SSE2)SHARC 2106x comparison
Intel Pentium 4 Reworked micro-architecture for high-bandwidth applicationsInternet audio and streaming video, image processing, video content creation, speech, 3D, CAD, games, multi-media, and multi-tasking user environmentsThese are DSP intensive applications! What about uses other than in PC?
Hardware Features:(NetBurst micro-architecture)Hyper pipelined technologyAdvanced dynamic executionCache (data, L1, L2)Rapid ALU execution engines400 MHz busOOEMicrocode ROM
Hyper Pipeline20-stage pipeline!!!breaks down complex CISC instructionssub-stages mimic RISCfaster execution
Filling the pipeline...Review of next 126 instructions to be executed
Branch predictionif mispredict must flush 20-stage pipeline!!!branch target buffer (BTB)4K branch history table (BHT)assembly instruction hints
Cache8KB Data CacheL1 Execution Trace Cache12K of previous micro-instructions storedsaves having to translateL2 Advanced Transfer Cache256K for data256-bit transfer every cycleallows 77GB/s data transfer on 2.4GHz
Rapid ALU Execution Engines2 ALUsallow parallel operationsMany arithmetic operations take 1/2 cycleeach 2X ALU can have 2 operations per cycle
Software Features:Multimedia Extensions (MMX)8 MMX registersStreaming SIMD Extensions (SSE2)8 SSE/SSE2 registersStandard x86 RegistersEAX, EBX, ECX, EDX, ESI, etc.Register rename to over 100
MMX (Multimedia Extensions)Accelerated performance through SIMD multimedia, communication, internet applications64-bit packed INTEGER datasigned/unsigned
SSE2 (Streaming SIMD Extensions)Accelerate a broad range of applicationsvideo, speech, and image, photo processing, encryption, financial, engineering, and scientific applications128-bit SIMD instruction formats4 single precision FP values 2 double precision FP values 16 byte values 8 word values 4 double word values 2 quad word values 1 128-bit integer value
SIMD Example(16-tap FIR filter - Real numbers)Applications for real FIR filtersgeneral purpose filters in image processing, audio, and communication algorithms
Will utilize SSE2 SIMD instruction set
Thinking about SIMD SSE2 instruction format is 128-bits128-bit SSE2 registersMany data formats!What precision do we want?
Lets use 32-bit floating point for coefficients, input, output4 data sets x 32-bit = 128 bits
ParallelizingRequire many single multiplications (coefficients x inputs), then add the results for output!Multiplications
then need to perform additions...
Using SSE2 formatCan hold 4 elements of an array (of 32-bit data) in each 128-bit register4 single precision floating point ops per cycle (32-bit)
Additions... In both registers, now have 4 32-bit resultsFirst add the results into an accumulator register4 single precision floating point ops per cycle (32-bit)
Additions... In a register, now have 4 32-bit resultshowever, NO SSE2 instruction to add these 4!But can use other instructionsSome BIT INTERTWININGthen addThis will give results for several output values!
ADI SHARC 21k vs. P4DisadvantagesSlower clock speed (40MHz vs 2400MHz)Less opportunities for parallelism (5 vs 11)Much less memory (Cache and System)Limited algorithm applicabilityLimited applicationsOlder (Less support compiler)1994 vs 2001
ADI Sharc 21k vs. P4AdvantagesHardware loopsEasier to program for optimal speedCheaperLower power consumptionRuns cooler
FIR PerformanceHard to obtain P4 performance numbersCan estimate based on 2 FP multiplies per clock, clock rate and assumption that pipeline can be kept full.2 * 2.4GHz ~ 4.8 billion multiplies per secondIf ~4 multiplies per element & 44000 samples/sFIR length > ~25k tapsSHARC => ~ 200 taps (Lab 4)Factor of ~125x
IIR PerformanceHard to obtain P4 performance numbersNo hardware circular buffersDoes have BTB, BHT, etc.Prefetches ~256bytes ahead of current position in code.
FFT PerformanceHard to obtain P4 performance numbersPrime95 uses FFT to calculate Lucas-Lehmer test for Mersenne PrimesInvolves FFT, squaring and iFFT, etc.256k points on P4 2.3GHz ~ 10.517msCompare to SHARC 2048 point FFT ~0.37msIf SHARC could do 256k, 46.25ms (But)
Optimization ExampleHard to optimize Pentium 4 assemblyExample of multiplying by a constant, 10Taken mainly from: www.emulators.com/docs/pentium_1.htm
Multiplying by 10Slowest way: IMUL EAX, 10Usually optimal way (Visual C++ 6.0)LEA EAX, [EAX+EAX*4]SHL EAX, 1Shift Add ShiftOn most x86 processors takes 2 cyclesPentium MMX and before 3 cyclesOn Pentium 4 takes 6 cycles!
Multiplying by 10Optimal for Pentium 4LEA ECX, [EAX + EAX]LEA EAX, [ECX+EAX*8]On most x86 still takes 2 cyclesOn Pentium 4 takes ~ 3 cycles (OOE - Ops)But on older processors Pentium MMX and before this now takes 4 cycles!
Multiplying by 10Best generic caseLEA EAX, [EAX + EAX*4]ADD EAX, EAXOn most x86 still takes 2 cyclesOn older processors Pentium MMX and before this now takes 3 cycles againOn Pentium 4 this takes 4 cyclesObviously really hard to optimize
REFERENCESIntel application note: AP 809 - Real and Complex Filter Using Streaming SIMD Extentionsgraphics from: http://www6.tomshardware.com/cpu/00q4/001120/p4-01.html