Intel Pentium 4

Intel Pentium 4Intel Pentium 4ENCM 515 - 2002

Jonathan Bienert

Tyson Marchuk

Overview:Overview:

• Product review

• Specialized architectural features (NetBurst)

• SIMD instructional capabilities (MMX, SSE2)

• SHARC 2106x comparison

Intel Pentium 4

• Reworked micro-architecture for high-bandwidth applications

• Internet audio and streaming video, image processing, video content creation, speech, 3D, CAD, games, multi-media, and multi-tasking user environments

• These are DSP intensive applications!– What about uses other than in PC?

Hardware Features:Hardware Features:(NetBurst micro-architecture)

• Hyper pipelined technology

• Advanced dynamic execution

• Cache (data, L1, L2)

• Rapid ALU execution engines

• 400 MHz bus

• OOE

• Microcode ROM

Hyper PipelineHyper Pipeline

• 20-stage pipeline!!!

• breaks down complex CISC instructions– sub-stages mimic RISC– faster execution

Filling the pipeline...

• Review of next 126 instructions to be executed

• Branch prediction– if mispredict must flush 20-stage pipeline!!!– branch target buffer (BTB)– 4K branch history table (BHT)– assembly instruction hints

CacheCache

• 8KB Data Cache• L1 Execution Trace Cache

– 12K of previous micro-instructions stored– saves having to translate

• L2 Advanced Transfer Cache– 256K for data– 256-bit transfer every cycle

• allows 77GB/s data transfer on 2.4GHz

Rapid ALU Execution EnginesRapid ALU Execution Engines

• 2 ALUs– allow parallel operations

• Many arithmetic operations take 1/2 cycle– each 2X ALU can have 2 operations per cycle

Software Features:Software Features:• Multimedia Extensions (MMX)

– 8 MMX registers

• Streaming SIMD Extensions (SSE2)– 8 SSE/SSE2 registers

• Standard x86 Registers– EAX, EBX, ECX, EDX, ESI, etc.– Register rename to over 100

MMX (Multimedia Extensions)MMX (Multimedia Extensions)

• Accelerated performance through SIMD• multimedia, communication, internet applications

• 64-bit packed INTEGER data– signed/unsigned

SSE2 (Streaming SIMD SSE2 (Streaming SIMD Extensions)Extensions)

• Accelerate a broad range of applications– video, speech, and image, photo processing, encryption, financial,

engineering, and scientific applications

• 128-bit SIMD instruction formats 4 single precision FP values

2 double precision FP values

16 byte values

8 word values

4 double word values

2 quad word values

1 128-bit integer value

SIMD ExampleSIMD Example(16-tap FIR filter - Real numbers)16-tap FIR filter - Real numbers)

• Applications for real FIR filters• general purpose filters in image processing, audio,

and communication algorithms

• Will utilize SSE2 SIMD instruction set

Thinking about SIMDThinking about SIMD

• SSE2 instruction format is 128-bits

• 128-bit SSE2 registers

• Many data formats!

• What precision do we want?

• Lets use 32-bit floating point for coefficients, input, output

4 data sets x 32-bit = 128 bits

ParallelizingParallelizing• Require many single multiplications

(coefficients x inputs), then add the results for output!

• Multiplications…

• then need to perform additions...

Using SSE2 formatUsing SSE2 format• Can hold 4 elements of an array (of 32-bit

data) in each 128-bit register

• 4 single precision floating point ops per cycle (32-bit)

Additions...Additions...• In both registers, now have 4 32-bit results

– First add the results into an accumulator register

• 4 single precision floating point ops per cycle (32-bit)

Additions...Additions...• In a register, now have 4 32-bit results

– however, NO SSE2 instruction to add these 4!– But can use other instructions

• Some BIT INTERTWINING…then add

– This will give results for several output values!

ADI SADI SHARCHARC 21k vs. P4 21k vs. P4

Disadvantages

• Slower clock speed (40MHz vs 2400MHz)

• Less opportunities for parallelism (5 vs 11)

• Much less memory (Cache and System)– Limited algorithm applicability– Limited applications

• Older (Less support – compiler)– 1994 vs 2001

ADI Sharc 21k vs. P4ADI Sharc 21k vs. P4

Advantages

• Hardware loops

• Easier to program for optimal speed

• Cheaper

• Lower power consumption

• Runs cooler

FIR Performance

• Hard to obtain P4 performance numbers• Can estimate based on 2 FP multiplies per

clock, clock rate and assumption that pipeline can be kept full.– 2 * 2.4GHz ~ 4.8 billion multiplies per second– If ~4 multiplies per element & 44000 samples/s– FIR length > ~25k taps

• SHARC => ~ 200 taps (Lab 4)• Factor of ~125x

IIR Performance

• Hard to obtain P4 performance numbers

• No hardware circular buffers

• Does have BTB, BHT, etc.

• Prefetches ~256bytes ahead of current position in code.

FFT Performance

• Hard to obtain P4 performance numbers

• Prime95 uses FFT to calculate Lucas-Lehmer test for Mersenne Primes– Involves FFT, squaring and iFFT, etc.

• 256k points on P4 2.3GHz ~ 10.517ms

• Compare to SHARC 2048 point FFT ~0.37ms

• If SHARC could do 256k, 46.25ms (But…)

Optimization Example

• Hard to optimize Pentium 4 assembly

• Example of multiplying by a constant, 10

• Taken mainly from: www.emulators.com/docs/pentium_1.htm

http://www.emulators.com/docs/pentium_1.htm




Multiplying by 10

• Slowest way: – IMUL EAX, 10

• Usually optimal way (Visual C++ 6.0)– LEA EAX, [EAX+EAX*4]– SHL EAX, 1– Shift – Add – Shift– On most x86 processors takes 2 cycles– Pentium MMX and before 3 cycles– On Pentium 4 takes 6 cycles!

Multiplying by 10

• Optimal for Pentium 4– LEA ECX, [EAX + EAX]– LEA EAX, [ECX+EAX*8]– On most x86 still takes 2 cycles– On Pentium 4 takes ~ 3 cycles (OOE - Ops)– But on older processors Pentium MMX and

before this now takes 4 cycles!

Multiplying by 10

• Best generic case– LEA EAX, [EAX + EAX*4]– ADD EAX, EAX– On most x86 still takes 2 cycles– On older processors Pentium MMX and before

this now takes 3 cycles again– On Pentium 4 this takes 4 cycles

• Obviously really hard to optimize

REFERENCES

• Intel application note: AP 809 - Real and Complex Filter Using Streaming SIMD Extentions

• graphics from: http://www6.tomshardware.com/cpu/00q4/001120/p4-01.html

Documents

Intel Pentium 4