Intel Microarchitecture: Nehalem

Intel Microarchitecture: NehalemCorti Carlo 729868

Outline

Tendence

Core Architecture

Innovative Technologies

Microarchitecture Enhancement

Westmere

Tendence

Past Every improvements were consideredNo consideration of power cost

Now daysLimited resourcesIncreasing energy costStrict power/performance efficiency treshold“Do more with the same” Philosophy

Core Architecture

Released in November 2008Based on 45nm technology exploiting several innovations

Quad-Core architectureIntegrated Memory ControllerMulti-Chip Package with integrated graphics processorNew point-to-point processor interconnectionIncorporate the SSE 4.2 SIMD instructionsModular block of components

Innovative Technologies

Intel Turbo BoostDeliver Extra performance Activated by the Operating SystemConstraints (N° Cores, Temperature, Current)

Intel Hyper ThreadingSimultaneous multi-threading (2 each cores)Optimal use of every clock cycle

Intel Intelligent PowerIntegrated Power GatesAutomated Low-Power State

Microarchitecture Enhancement (1)

New three-level cache hierarchy

L1 cache (32KB Instruction, 32 KB Data)

L2 cache per core (256 KB)

Fully inclusive and fully shared 8MB L3 cache

The Nehalem microarchitecture implements the MESIFcache coherency protocol, an extended version of the well-known MESI protocol [5, p. 213]. Due to the novelty ofthis microarchitecture, we can only refer to a very limitednumber of publications that are relevant for our test system.Some information can be gathered from Intel documents [6],[7]. However, none of them describe the architecture in muchdetail.

We use BenchIT [8] to develop and run our memorybenchmarks as well as for the results evaluation. Thisperformance measurement suite is designed to run micro-benchmarks on every POSIX 1.003 compliant system in auser-friendly way. It helps to compare different algorithms,implementations of algorithms, properties of the softwarestack, and hardware details of whole systems. The softwareis available as Open Source.

III. SYSTEM ARCHITECTURE

Previous generation quad-core Xeon processors (Harper-town) are composed of two dual-core dies each with a sharedL2 cache. In contrast, the Xeon 5500 series processors(Nehalem-EP) are a native quad-core design. Similar toquad-core AMD Opteron processors (Shanghai), the L1 andL2 caches are implemented per core, while the L3 cache isshared among all cores of one processor. The Front Side Busused in previous Intel CPUs is replaced by point-to-pointlinks called Quick Path Interconnect (QPI). Moreover, eachprocessor contains its own integrated memory controller(IMC). The basic design of a two-socket Nehalem system isdepicted in Figure 1.

The Intel Nehalem microarchitecture supports simulta-neous multithreading (SMT) that allows each core to ex-ecute two threads in parallel. This technique is well-knownfrom the Pentium 4 processors based on Intel’s Netburstmicroarchitecture. Furthermore, processors based on theNehalem microarchitecture feature a dynamic overclockingmechanism (Intel Turbo Boost Technology) that allows theprocessor to raise core frequencies as long as the thermallimit is not exceeded. Table I shows the key differencesbetween the Nehalem microarchitecture and other commonx86 64 server CPUs.

!"#$%"&'()$*+,-"

.,-"'/

0#$-"*'1"2"%'3'.$+#"

45.63'.#$77"%8 (94

1:

.,-"': .,-"'; .,-"'3

1; 1;1;1;

4<='>)?

1:1:1:

!"#$%"&'()$*+,-"

.,-"'@

0#$-"*'1"2"%'3'.$+#"

(94

1:

.,-"'A .,-"'B .,-"'C

1; 1;1;1;

1:1:1:

DDE3'F

45.63'.#$77"%8

DDE3'.

DDE3'G

DDE3'D

DDE3'H

DDE3'I

Figure 1. System overview

Although the basic structure of the memory hierarchyis similar for Nehalem and Shanghai based processors, theimplementation details differ. While AMD processors use a“non-inclusive” L3 cache, Intel implements an inclusive lastlevel cache. “core valid bits” within the L3 cache indicatethat a cache line may be present in a certain core. If a bit isnot set, the associated core certainly does not hold a copyof the cache line, thus reducing snoop traffic to that core.However, unmodified cache lines may be evicted from acore’s cache without notification of the L3 cache. Therefore,a set core valid bit does not guarantee the presence ofthe cache line in a higher level cache. Generally speaking,the shared last level cache with its core valid bits has thepotential to strongly improve the performance of on-chipdata transfers between cores while filtering most unnecessarysnoop traffic.

Nehalem is the first microarchitecture that uses the MESIFcache coherency protocol. It extends the MESI protocol usedin previous Xeon generations by a fifth state called forward-ing. This state allows unmodified data that is shared by twoprocessors to be forwarded to a third one. We thereforeexpect the MESIF improvements to be limited to systemswith more than two processors. The benchmark results ofour dual-processor test system configuration should not beinfluenced.

Table ICOMPARISON OF DIFFERENT X86 64 MICROARCHITECTURES

Processor AMD Opteron 238* Intel Xeon 54** Intel Xeon 55**Microarchitecture Shanghai Harpertown Nehalem-EP

Cache organization non-inclusive inclusive inclusiveCache coherency protocol MOESI MESI MESIF

Shared last level cache yes no yesIntegrated memory controller yes no yes

Point-to-point processor interconnect yes no yesNative quad-core design yes no yes

!"#!"!

Authorized licensed use limited to: Politecnico di Milano. Downloaded on June 03,2010 at 13:19:13 UTC from IEEE Xplore. Restrictions apply.

Microarchitecture Enhancement (2)

Instruction per Cycle improvementsIncreased size of the out-of-order window and scheduler Increased size of the other buffers in the coreFaster Synchronization PrimitivesImproved Hardware Prefetch and Better Load-Store Scheduling

Enhanced Branch PredictionFaster Handling of Branch Mispredictions New Second-Level Branch Target BufferNew Renamed Return Stack Buffer

Westmere

Intel Nehalem Microarchitecture processor migrated to 32nm.Key features

Processors with 6 cores supporting 12 threads;Smaller processor core size;New Multi-Chip Package with graphics integrated in processors;Integrated, discrete/switchable graphics support;Advanced Encryption Standard acceleration;Integrated memory controller.

Performance Evaluation

x2 multithread performance (same power)

30% lower power usage for the same performance

15–20% increase in performance per core

50% less atomic operation latency

References

“Intel Core i7-800 Processor Series and the Intel Core i5-700 Processor Series Based on Intel Microarchitecture (Nehalem)”, Intel Corporation

”Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System”, Daniel Molka, Daniel Hackenberg, Robert Schone and Matthias S Muller

“32nm westmere Family of Processor”, Stephen L. Smith

Wikipedia

Thanks for the attention!!

Documents

Intel Microarchitecture: Nehalem