The NetBurst™ micro-architecture delivers a number of
new and innovative features including – Hyper-Threading Technology,
– Hyper Pipelined Technology,
– 533-MHz or 400-MHz system bus,
– Execution Trace Cache, and
– Rapid Execution Engine
as well as a number of enhanced features – Advanced Transfer Cache,
– Advanced Dynamic Execution,
– Enhanced Floating-point and Multi-media Unit, and
– Streaming SIMD Extensions 2 (SSE2).
• Hyper-Threading Technology supports
multiple software threads on each processor in a system.
• Hyper-Threading Technology, which was pioneered
on Intel's advanced server processors, helps your PC work more efficiently by maximizing processor resources and enabling a single processor to run two separate threads of software simultaneously.
• The result is greater performance and system responsiveness when running multiple applications at once.
• So you can multitask like never before.
• To carry the Intel® Pentium® 4 Processor with HT
Technology logo, systems must have: The Intel Pentium 4 processor at 3.06 GHz or higher
An Intel® E7205 chipset that supports HT Technology
System BIOS supports HT Technology and has it enabled
An operating system that includes optimizations for HT Technology
• Windows XP Professional Edition
• Windows XP Home Edition
• Hyper-Threading Technology enables multi-
threaded software applications to execute threads in parallel.
• To improve performance in the past, threading was enabled in the software by splitting instructions into multiple streams so that multiple processors could act upon them.
• Today with Hyper-Threading Technology, processor-level threading can be utilized which offers more efficient use of processor resources for greater parallelism and improved performance on today's multi-threaded software.
• Hyper-Threading Technology is a form of simultaneous multi-
threading technology (SMT) where multiple threads of software applications can be run simultaneously on one processor.
• This is achieved by duplicating the architectural state on each processor, while sharing one set of processor execution resources.
• Hyper-Threading Technology also delivers faster response times for multi-tasking workload environments.
• By allowing the processor to use on-die resources that would otherwise have been idle, Hyper-Threading Technology provides a performance boost on multi-threading and multi-tasking operations for the Intel® NetBurst™ microarchitecture.
Hyper Pipelined Technology
• The hyper-pipelined technology of the NetBurst™
micro-architecture doubles the pipeline depth compared to the P6 micro-architecture used on today's Pentium® III processors.
• One of the key pipelines, the branch prediction / recovery pipeline, is implemented in 20 stages in the NetBurst™ micro-architecture, compared to 10 stages in the P6 micro-architecture.
• This technology significantly increases the performance, frequency, and scalability of the processor.
533-MHz System BusThe Pentium 4 processor's 533-MHz system bus
supports Intel's highest performance desktop processor by delivering 4.2 GB of data-per-second into and out of the processor.
This is accomplished through a physical signaling scheme of quad pumping the data transfers over a 133-MHz clocked system bus and a buffering scheme allowing for sustained 533-MHz data transfers.
This compares to 1.06 GB/s delivered on the Pentium III processor's 133-MHz system bus.
400 MHz System Bus
• The Intel® Pentium® 4 processor supports Intel's
highest performance desktop system bus by delivering 3.2 GB of data per second into and out of the processor.
• This is accomplished through a physical signaling scheme of quad pumping the data transfers over a 100-MHz clocked system bus and a buffering scheme allowing for sustained 400-MHz data transfers.
• This compares to 1.06 GB/s delivered on the Pentium® III processor's 133-MHz system bus.
Level 1 Execution Trace Cache
• In addition to the 8 KB data cache, the Intel®
Pentium® 4 processor includes an Execution Trace Cache that stores up to 12 K decoded micro-ops in the order of program execution. This increases performance by removing the decoder from the main execution loop and makes more efficient usage of the cache storage space since instructions that are branched around are not stored. The result is a means to deliver a high volume of instructions to the processor's execution units and a reduction in the overall time required to recover from branches that have been mispredicted.
Rapid Execution Engine
• Two Arithmetic Logic Units (ALUs) on the Intel®
Pentium® 4 processor are clocked at twice the core processor frequency.
• This allows basic integer instructions such as Add, Subtract, Logical AND, Logical OR, etc. to execute in ½ a clock cycle.
• For example, the Rapid Execution Engine on a 2.20 GHz Intel® Pentium® 4 processor runs at 4.4 GHz.
512KB or 256 KB, Level 2 Advanced Transfer Cache
• 512 KB L2 Advanced Transfer Cache (ATC) is available with speeds 2.20 GHz and
• 256 KB L2 ATC is available with speeds 1.30 GHz to 2 GHz.
• The Level 2 ATC delivers a much higher data throughput channel between the Level 2 cache and the processor core.
• The Advanced Transfer Cache consists of a 256-bit (32-byte) interface that transfers data on each core clock.
• As a result, the Intel® Pentium® 4 processor at 2.20 GHz can deliver a data transfer rate of 70 GB/s. This compares to a transfer rate of 16 GB/s on the Pentium® III processor at 1 GHz.
• Features of the ATC include:
Non-Blocking, full speed, on-die level 2 cache
8-way set associativity
256-bit data bus to the level 2 cache
Data clocked into and out of the cache every clock cycle
Advanced Dynamic Execution
• The Advance Dynamic Execution engine is a very deep, out-of-order
speculative execution engine that keeps the execution units executing instructions.
• The Intel® Pentium® 4 processor can also view 126 instructions in flight and handle up to 48 loads and 24 stores in the pipeline.
• It also includes an enhanced branch prediction algorithm that has the net effect of reducing the number of branch mispredictions by about 33% over the P6 generation processor's branch prediction capability.
• It does this by implementing a 4 KB branch target buffer that stores more detail on the history of past branches, as well as by implementing a more advanced branch prediction algorithm.
Enhanced Floating-point and Multi-media Unit
• The Intel® Pentium® 4 processor expands the
floating-point registers to a full 128-bit and
• adds an additional register for data movement which improves performance on both floating-point and multi-media applications.
Internet Streaming SIMD Extensions 2 (SSE2)
• With the introduction of SSE2, the NetBurst™ micro-architecture
now extends the SIMD capabilities that MMX technology and SSE technology delivered by adding 144 new instructions.
• These instructions include 128-bit SIMD integer arithmetic and 128-bit SIMD double-precision floating-point operations.
• These new instructions reduce the overall number of instructions required to execute a particular program task and as a result can contribute to an overall performance increase.
• They accelerate a broad range of applications, including video, speech, and image, photo processing, encryption, financial, engineering and scientific applications.
Data Prefetch Logic
• Functionality that anticipates the data needed by an
application and pre-loads it into the Advanced Transfer Cache, further increasing processor and application performance.
Features Used for Testing and Performance / Thermal Monitoring
• Built-in Self Test (BIST) provides single stuck-at fault coverage of
the microcode and large logic arrays, as well as testing of the instruction cache, data cache, Translation Lookaside Buffers (TLBs), and ROMs.
• IEEE 1149.1 Standard Test Access Port and Boundary Scan mechanism enables testing of the Pentium 4 processor and system connections through a standard interface.
• Internal performance counters for performance monitoring and event counting.
• Includes a new Thermal Monitor feature that allows motherboards to be cost effectively designed to expected application power usages rather than theoretical maximums.
• In a pipelined architecture, instruction execution
overlaps. So even though it might take five clock cycles to execute each instruction, there can be five instructions in various stages of execution simultaneously. That way it looks like one instruction completes every clock cycle.
Levels of Cache
• To give you the big picture of it all, here's a list of a
normal caching system: L1 cache - Memory accesses at full microprocessor speed (10
nanoseconds, 4 kilobytes to 16 kilobytes in size)
L2 cache - Memory access of type SRAM (around 20 to 30 nanoseconds, 128 kilobytes to 512 kilobytes in size)
Main memory - Memory access of type RAM (around 60 nanoseconds, 32 megabytes to 128 megabytes in size)
Hard disk - Mechanical, slow (around 12 milliseconds, 1 gigabyte to 10 gigabytes in size)
Internet - Incredibly slow (between 1 second and 3 days, unlimited size)
microprocessor (actually, 1.2 million) packaged in l68-pin PGA packaging.
• Its 32-bit registers can take care of more than 95% of the operands in high-level languages.
• The following are the ways the 486 is enhanced in comparison to the 386.
• By heavily pipelining the fetching and
execution of instructions, many 486 instructions are executed in only 1 clock cycle instead of in 3 clocks as in the 386.
• Fetching and execution of each instruction is split
into many stages,
• all working in parallel.
• This allows the processing of up to five instructions to be overlapped.
• In 8085 there was no pipelining.
• 8086 had enjoyed the first pipelining.
• In the 486 the pipeline stage is broken down even further, to 5 stages as follows:
– 1. fetch (prefetch)
– 2. decode 1 (two stage decode)
– 3. decode 2
– 4. execute
– 5. register write-back (result goes to EAX)
pipelined vs. nonpipelined execution
• By putting 8K bytes of cache with the core of the CPU all on a single
chip, the 486 eliminates the interchip delay of external cache.
• In other words, while in the 386 the cache is external, the 486 has 8K bytes of on-chip cache to store both code and data.
• Although the 486 has 8K bytes of on-chip cache, l28K to 256K bytes of off-chip cache are also present in many systems.
• Off-chip cache (level two) is commonly referred to as secondary cache, while on-chip cache is called first-Ievel cache.
• The 8K on-chip cache of the 486 has 2-way set associative organization and is used for storing both data and code. It uses the write-through policy for updating main memory.
a math coprocessor on the same chip
• This reduces the interchip delay associated
with a multichip system such as the 386 and 387 but
• at the same time made the cost of a 80486 high compared to a 386 since the 80486 is in reality two chips in one:
– the main CPU and
– math coprocessor.
The Burst Cycle
• The memory cycle time of the 486 with the normal zero
wait states is 2 clocks.
• In other words, it takes a minimum of 2 clocks to read from or write to external memory or 1/0.
• In this regard, the 486 is like the 386.
• To increase the bus performance of the 486, Intel provides an additional option of implementing what is called a burst cycle.
• The 486 has two types of memory cycles, nonburst (which is the same as the 386) and burst mode.
• In the burst cycle, the 486 can perform 4 memory cycles in just 5 clocks.
• The initial read is performed in a normal 2-
clock memory cycle time, but the next three reads are performed each with only one clock.
• Therefore, four reads are performed in only 5 clocks.
• This is commonly referred to as 2-1-1-1 read, which means 2 clocks for the first read and 1 clock for each of the following three reads.
• This is in contrast to 386, which is 2-2-2-2 for reading 4 double words of aligned data.
• Intel put 3.1 million transistors on a single
piece of silicon using a 273-pin PGA package to design the next generation of 80x86.
• It is called Pentium instead of 80586.
3 ways to increase the processing power
• Increase the clock frequency of the chip.
• Increase the number of data buses to bring more information.
• Overlap the execution of more instructions. – superpipeline and
• In superpipelining, the process of fetching and executing
instructions is split into many small steps and all are done in parallel.
• In this way the execution of many instructions is overlapped.
• The number of instructions being processed at a given time depends on the number of pipeline stages, commonly termed the pipeline depth.
• Some designers use as many as 8 stages of pipelining.
• One limitation of superpipelining is that the speed of the execution is limited to the the slowest stage of the pipeline.
• Compare this to making pizza.
• In superscaling, the entire execution unit has
been doubled and each unit has 5 pipeline stages.
• Therefore, in superscalar, there is more than one execution unit and each has many stages, rather than one execution unit with 8 stages as in the case of a superpipelined processor.
• In superscaling we have two ( or even three) execution units.
• Both the on-chip data and code caches are accessed internally by the
CPU core simultaneously.
• However, since there is only one set of address buses, the external cache containing both data and code must be accessed one at a time and not simultaneously.
• Some CPUs, notably RISC processors, use a separate set of address and data pins (buses) for the data and another set of address and data buses for the code section of the program.
• This is called Harvard architecture.
• The Pentium accesses the on-chip code and data caches simultaneously using Harvard architecture, but not the secondary (external) off-chip cache and data.
• Also known as 2-way set associative.
Pentium uses 8-stage pipeline
• The FPU section of the Pentium uses an 8-
stage pipeline to process instructions, in contrast to the 5-stage pipeline in the integer unit.
• Branch prediction is another new feature of the Pentium.
• The penalty for jumping is very high for a high-performance pipelined microprocessor such as the Pentium.
• For example, in the case of the JNZ instruction, if it jumps, the pipeline must be flushed and refilled with instructions from the target location.
• This takes time. In contrast, the instruction immediately below the JNZ is already in the pipeline and is advancing without delay.
• The Pentium processor has the capability to predict and prefetch code from both possible locations and have them advanced through the pipeline without waiting (installing) for the outcome of the zero flag.
• The ability to predict branches and avoid the branch penalty combined with the instruction pairing can result in a substantial reduction in the clock count for a given program.
burst read and burst write cycles
• The Pentium has both burst read and burst
• This is in contrast to the 486, which has only the burst read.
• To increase both the internal and external clock frequency of the CPU requires
faster DRAM, high-speed motherboard design, high-speed peripherals, and efficient power management due to a high level of power dissipation.
• As a result, the system is much more expensive.
• To solve this problem, Intel came up with what is called overdrive technology, also referred to as clock doubler and tripler.
• The idea of a clock doubler or tripler is to increase the internal frequency of the CPU while the external frequency remains the same. In this way, the CPU processes code and data internally faster while the motherboard costs remain the same.
• For example, the 486DX2-50 uses the internal frequency of 50 MHz but the external frequency by which the CPU communicates with memory and peripherals is only 25 MHz. This allows the instructions stored in the queue of 486 to be executed at twice the speed of fetching them from the system buses. With the advent of the 32- and 64-bit
Pentium Pro Processor
• Intel used 5.5 million transistors to make the
• The first Pentium Pro introduced in 1995 had a speed of 150 Mhz and consumed 23 watts of power at that speed.
• For the first time, Intel also attached level 2 (L2) cache to the Pentium Pro all on a single package but with two separate dies.
• the L2 cache uses over 10 millions transistors depending on the size of L2 cache.
Pentium Pro: internal architecture
• Intel finally yielded to the rise of RISC concepts in
the design of the Pentium Pro.
• In the Pentium Pro, all x86 instructions brought into the CPU are broken down into one or more small and easy to execute instructions. These easily executable instructions are called micro-operations (µops) by Intel.
Pentium Pro: internal architecture
• This is similar to the concept in RISC except that in RISC
architecture the instruction set is very simple and easy to execute, and the instructions stored in memory are exactly the same as the ones inside the CPU.
• In contrast, Intel had to maintain code compatibility for the Pentium Pro with all previous x86 processors, all the way back to 8086.
• Therefore, Intel had no choice but to convert the x86 instructions produced by the compiler/assembler into micro-operations internally inside the CPU.
• An interesting aspect of converting x86 instructions into micro-ops internally is that it uses what is called triadic instruction formats.
• In triadic instruction format, there are two source registers and one destination register.
Pentium Pro is both superpipelined and superscalar
• As mentioned above, in the Pentium Pro fall x86 instructions are
converted into micro-ops with triadic formats before they are processed.
• This conversion allows an increase in the pipeline stages with little difficulty. Intel uses a 12-stage pipeline for the Pentium Pro.
• In contrast to the 5-pipestage of the Pentium, although each pipe stage of the 12-pipestage Pentium Pro performs less work, there are more stages.
• This means that in the Pentium Pro, more instructions can be worked on and finished at a time.
• The Pentium Pro with its 12-stage pipeline is referred to as superpipelined.
• Since it also has multiple execution units capable of working in parallel, it is also superscalar.
• In Pentium architecture, when one of the pipeline
stages is stalled, the prior stages of fetch and decode are also stalled.
• In other words, the fetch stage stops fetching instructions if the execution stage is stalled, due for example to a delay in memory access.
• This dependency of fetch and execution has to be resolved in order to increase CPU performance.
• That is exactly what Intel has done with the Pentium Pro and is called decoupling the fetch and execution phases of the instructions.
• In the Pentium Pro, as x86 instructions are fetched from
memory they are decoded (converted) into a series of micro-ops, or RISC-type instructions, and placed into a pool called the instruction pool.
• This fetch/decode of the instructions is done in the same order as the program was coded by the programmer ( or compiler).
• However, when the micro-ops are placed in the instruction pool they can be executed in any order as long as the data needed is available.
• In other words, if there is no dependency, the instructions are executed out of order, not in the same order as the programmer coded them.
• Multi Media Extension
• In early 1997 Intel introduced a series a series of Pentium chips with limited DSP capability.
• DSP- Digital Signal Processing:– 2-D, 3D graphics,
– Video and Audio compression,
– PC-based telephony,
– Image processing.
MMX uses floating-point registers
• It uses the floating point register by aliasing,
meaning that the same physical register has different names.