Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Next Generation Intel® MicroarchitectureIntel® Microarchitecture
(Nehalem) Family: Architectural Insights andArchitectural Insights and
Power Management
Steve GuntherRonak Singhal
Intel® Core™ microarchitecture (Nehalem) Architects(Nehalem) Architects
Intel Corporation
NGMS001NGMS001
Agenda• Intel® Core™ Microarchitecture (Nehalem) – Ronak
Singhal• Intel Core microarchitecture (Nehalem) Design ( ) g
Philosophy• Enhanced Processor Core Description• New Cache Hierarchy Platform• Virtualization• New Instructions
• Intel Core Microarchitecture (Nehalem) power ( ) pmanagement – Steve Gunther
• Intel Core microarchitecture (Nehalem) power management overview
• Minimizing idle power consumption• Performance when you need it
2
TickTick--Tock Development ModelTock Development Model
SandySandyPenrynPenryn NehalemNehalem SandySandyBridgeBridge
WestmereWestmere
NEWNEWMicroarchitectureMicroarchitecture
45nm45nm
NEWNEWMicroarchitectureMicroarchitecture
32nm32nm
NEWNEWProcessProcess45nm45nm
NEW NEW ProcessProcess32nm32nm
MeromMerom11
NEWNEWMicroarchitectureMicroarchitecture
65nm65nm 45nm45nm 32nm32nm45nm45nm 32nm32nm65nm65nm
TOCKTOCK TICKTICK TOCKTOCK TICKTICK TOCKTOCK
ForecastForecast
33
All dates, product descriptions, availability and plans are forecasts and subject to change without notice.
1Intel® Core™ microarchitecture (formerly Merom)45nm next generation Intel® Core™ microarchitecture (Penryn)Intel® Core™ Microarchitecture (Nehalem)Intel® Microarchitecture (Westmere)Intel® Microarchitecture (Sandy Bridge)
Intel® Core™ Microarchitecture (Nehalem) Design GoalsWorld class performance combined with superior energy efficiency World class performance combined with superior energy efficiency –– Optimized for: Optimized for:
Dynamically scaled performance when
Single ThreadSingle Thread
p
Existing AppsExisting AppsEmerging AppsEmerging Apps
All UsagesAll Usages
MultiMulti--threadsthreads
needed to maximize energy efficiency
A single, scalable, foundation optimized across each segment and power envelope A single, scalable, foundation optimized across each segment and power envelope
Workstation / ServerDesktop / Mobile
4
A Dynamic and Design Scalable A Dynamic and Design Scalable MicroarchitectureMicroarchitecture
Scalable CoresC fC f
Same core forSame core for Common softwareCommon software Common feature setCommon feature setSame core forSame core forall segmentsall segments
Common softwareCommon softwareoptimizationoptimization
® C® C
45nm45nm
Servers/WorkstationsEnergy Efficiency, Performance, Virtualization, Reliability, Capacity, Scalability
Intel® Core™ Intel® Core™ Microarchitecture Microarchitecture
(Nehalem)(Nehalem)
y, p y, y
DesktopPerformance Graphics EnergyPerformance, Graphics, Energy Efficiency, Idle Power, Security
MobileMobileBattery Life, Performance,Energy Efficiency, Graphics, Security
5
Optimized cores to meet all market segmentsOptimized cores to meet all market segments
The First Intel® Core™ Microarchitecture (Nehalem) Processor
Mi
Mi
Memory Controller
isc
I
isc
ICore Core Core CoreQ
u OO
ueue
QPI
1
QPI
0Shared L3 Cache
QPI: Intel® QuickPath Interconnect (Intel® QPI)
6
A Modular Design for FlexibilityA Modular Design for FlexibilityInterconnect (Intel® QPI)
Intel® Core™ Microarchitecture Recap
• Wide Dynamic Execution– 4-wide decode/rename/retire
d d l d• Advanced Digital Media Boost– 128-bit wide SSE execution units
• Intel HD Boost– New SSE4.1 Instructions
• Smart Memory Access– Memory Disambiguationy g– Hardware Prefetching
• Advanced Smart Cache– Low latency, high BW shared L2 cacheLow latency, high BW shared L2 cache
Nehalem builds on the great Core microarchitecture
7
Agenda• Intel® Core™ Microarchitecture (Nehalem)
Design PhilosophyE h d P C• Enhanced Processor Core
• Performance Features• Intel® Hyper-Threading Technology
• New Platform• New Cache Hierarchy• New Platform Architecture
• Performance Acceleration• Virtualization• New Instructions• New Instructions
8
Designed for PerformanceAdditional CachingNew SSE4 2 Improved Lock
L2 C hL1 D t C h
Additional CachingHierarchy
New SSE4.2 Instructions
Improved Lock Support
ExecutionUnits
L2 Cache& InterruptServicing
L1 Data Cache
Memory Ordering
O t f O dBranch Prediction
I t ti
Pagingy g
& ExecutionDeeper Buffers Improved
Loop StreamingOut-of-Order
Scheduling &Retirement Instruction Fetch
& L1 Cache
InstructionDecode &Microcode
Streaming
FasterVirtualization
SimultaneousMulti-Threading
Better BranchPrediction
9
Enhanced Processor CoreInstruction Fetch and
Pre Decode
Instruction Queue
ITLB32kB
Instruction Cache
Front EndExecution
EngineInstruction Queue
Decode
2nd Level TLB4
gMemory
Rename/Allocate
Retirement Unit(ReOrder Buffer)
2 Level TLB
4
256kB2nd Level Cache
L3 and beyond
Reservation Station
Execution Units6
Execution Units
DTLB
32kBD C h
10
Data Cache
Front-end
• Responsible for feeding the compute engine– Decode instructionsDecode instructions– Branch Prediction
• Key Intel® Core™2 microarchitecture Features4 id d d– 4-wide decode
– Macrofusion– Loop Stream Detector
Instruction Fetch andPre Decode
Instruction Queue
ITLB32kB
Instruction Cache
Instruction Queue
Decode
11
MacrofusionI t d d i I t l® C ™2 i hit t• Introduced in Intel® Core™2 microarchitecture
• TEST/CMP instruction followed by a conditional branch treated as a single instruction– Decode/execute/retire as one instructionDecode/execute/retire as one instruction
• Higher performance & improved power efficiency– Improves throughput/Reduces execution latency– Less processing required to accomplish the same work
• Support all the cases in Intel Core 2 microarchitecture PLUS– CMP+Jcc macrofusion added for the following branch conditions
– JL/JNGE– JGE/JNL/– JLE/JNG– JG/JNLE
– Intel® Core™ microarchitecture (Nehalem) supports macrofusion in both 32-bit and 64-bit modes – Intel Core2 microarchitectureonly supports macrofusion in 32-bit mode
Increased macrofusion benefit on Intel®
12
Core™ microarchitecture (Nehalem)
Loop Stream Detector Reminder
• Loops are very common in most software• Take advantage of knowledge of loops in HW
– Decoding the same instructions over and overg– Making the same branch predictions over and over
• Loop Stream Detector identifies software loops– Stream from Loop Stream Detector instead of normal path
Disable unneeded blocks of logic for power savings– Disable unneeded blocks of logic for power savings– Higher performance by removing instruction fetch limitations
Intel® Core™2 Loop Stream Detector
BranchPrediction
Fetch DecodeLoop
Stream
Prediction Detector
18 Instructions
13
Instructions
Intel® Core™ Microarchitecture (Nehalem Loop) Stream Detector(Nehalem Loop) Stream Detector
• Same concept as in prior implementationsHi h f E d th i f th l • Higher performance: Expand the size of the loops detected
• Improved power efficiency: Disable even more Improved power efficiency: Disable even more logic
Intel Core Microarchitecture (Nehalem) Loop Stream Detector
BranchPrediction
Fetch DecodeLoop
StreamDPrediction Detector
28 Micro-Ops
14
Micro-Ops
Branch Prediction Improvements
• Focus on improving branch prediction accuracy each CPU generation– Higher performance & lower power through more
accurate prediction
• Example Intel® Core™ microarchitecture (Nehalem) Example Intel Core microarchitecture (Nehalem) improvements– L2 Branch Predictor
Improve accuracy for applications with large code size (ex – Improve accuracy for applications with large code size (ex. database applications)
– Advanced Renamed Return Stack Buffer (RSB)Remove branch mispredicts on x86 RET instruction (function – Remove branch mispredicts on x86 RET instruction (function returns) in the common case
Greater Performance through Branch Prediction
15
Execution Engine
• Responsible for:Scheduling operations– Scheduling operations
– Executing operations
• Powerful Intel® Core™2 microarchitecture execution engine– Dynamic 4-wide Execution– Intel® Advanced Digital Media Boost– Intel® Advanced Digital Media Boost
– 128-bit wide SSE
– Super Shuffler (45nm next generation Intel® Core™ microarchitecture (Penryn))microarchitecture (Penryn))
16
Execution Unit OverviewExecute 6 operations/cycle• 3 Memory Operations
• 1 Load• 1 Store Address
S
Unified Reservation Station• Schedules operations to Execution units• Single Scheduler for all Execution Units• Can be used by all integer, all FP, etc.
• 1 Store Data• 3 “Computational” Operations
Unified Reservation Station
Can be used by all integer, all FP, etc.
Po
rt 0
Po
rt 1
Po
rt 2
Po
rt 3
Po
rt 4
Po
rt 5
Load StoreAddress
StoreData
Integer ALU & Shift
Integer ALU &LEA
Integer ALU &Shift
BranchFP AddFP Multiply
ComplexInteger
Divide
SSE Integer ALUInteger Shuffles
SSE Integer Multiply
FP Shuffle
SSE Integer ALUInteger Shuffles
17
Increased ParallelismConcurrent uOps Possible
1
• Goal: Keep powerful execution engine fed
• Nehalem increases size of out 80
96
112
128
p
of order window by 33%• Must also increase other
corresponding structures 0
16
32
48
64
p gDothan Merom Nehalem
Structure Intel® Core™ microarchitecture (formerly Merom)
Intel® Core™microarchitecture (Nehalem)
Comment
Reservation Station 32 36 Dispatches operations to execution units
Load Buffers 32 48 Tracks all load operations allocated
Increased Resources for Higher Performance
operations allocated
Store Buffers 20 32 Tracks all store operations allocated
18
g
1Intel® Pentium® M processor (formerly Dothan)Intel® Core™ microarchitecture (formerly Merom)Intel® Core™ microarchitecture (Nehalem)
Enhanced Memory Subsystem
• Responsible for:Handling of memory operations (loads/stores)– Handling of memory operations (loads/stores)
• Key Intel® Core™2 Features– Memory Disambiguation y g– Hardware Prefetchers – Advanced Smart Cache
• New Intel® Core™ Microarchitecture (Nehalem) • New Intel® Core™ Microarchitecture (Nehalem) Features– New TLB Hierarchy– Fast 16-Byte unaligned accesses– Faster Synchronization Primitives
19
New TLB Hierarchy
• Problem: Applications continue to grow in data size• Need to increase TLB size to keep the pace for performance• Need to increase TLB size to keep the pace for performance• Nehalem adds new low-latency unified 2nd level TLB
# of Entries# of Entries
1st Level Instruction TLBs
Small Page (4k) 128
Large Page (2M/4M) 7 per threadLarge Page (2M/4M) 7 per thread
1st Level Data TLBs
Small Page (4k) 64
Large Page (2M/4M) 32
New 2nd Level Unified TLB
Small Page Only 512
20
g y
Fast Unaligned Cache Accesses
• Two flavors of 16-byte SSE loads/stores exist – Aligned (MOVAPS/D, MOVDQA) -- Must be aligned on a 16-byte boundary– Unaligned (MOVUPS/D, MOVDQU) -- No alignment requirement
®• Prior to Intel® Core™ microarchitecture (Nehalem)– Optimized for Aligned instructions– Unaligned instructions slower, lower throughput -- Even for aligned accesses!
– Required multiple uops (not energy efficient)Compilers would largely avoid unaligned load – Compilers would largely avoid unaligned load
– 2-instruction sequence (MOVSD+MOVHPD) was faster
• Intel Core microarchitecture (Nehalem) optimizes Unaligned instructions– Same speed/throughput as Aligned instructions on aligned accesses– Optimizations for making accesses that cross 64-byte boundaries fastOptimizations for making accesses that cross 64-byte boundaries fast
– Lower latency/higher throughput than Core 2– Aligned instructions remain fast
• No reason to use aligned instructions on Intel Core microarchitecture (Nehalem)!• Benefits:• Benefits:
– Compiler can now use unaligned instructions without fear– Higher performance on key media algorithms– More energy efficient than prior implementations
21
Faster Synchronization Primitives
• Multi-threaded software becoming more prevalent
LOCK CMPXCHG Performance
0.91
1
becoming more prevalent• Scalability of multi-thread
applications can be limited by synchronization 0 4
0.50.60.70.8
ativ
e La
tenc
y
synchronization• Synchronization primitives:
LOCK prefix, XCHG• Reduce synchronization 0
0.10.20.30.4
P ti 4 C 2 N h l
Rel
a
• Reduce synchronization latency for legacy software
Pentium 4 Core 2 Nehalem
Greater thread scalability with Nehalem
22
1Intel® Pentium® 4 processorIntel® Core™2 Duo processorIntel® Core™ microarchitecture (Nehalem)-based processor
Intel® Hyper-Threading Technology• Also known as Simultaneous Multi-• Also known as Simultaneous Multi
Threading (SMT)– Run 2 threads at the same time per core
• Take advantage of 4-wide execution engine
w/o SMT SMT
g g– Keep it fed with multiple threads– Hide latency of a single thread
• Most power efficient performance feature cycl
es)
– Very low die area cost– Can provide significant performance benefit
depending on application– Much more efficient than adding an entire m
e (p
roc.
Much more efficient than adding an entire core
• Intel® Core™ microarchitecture (Nehalem) advantages
Tim
Note: Each box – Larger caches– Massive memory BW
Sim ltaneo s m lti th eading enhances
represents a processor
execution unit
23
Simultaneous multi-threading enhances performance and energy efficiency
Intel® Core™ Microarchitecture (Nehalem) SMT Implementation Detailsp
Policy Description Intel® Core™ Microarchitecture
(Nehalem) Examples (Nehalem) Examples
Replicated Duplicate logic per thread Register StateRenamed RSBLarge Page ITLB
Partitioned Statically allocated between threads
Load BufferStore BufferReorder BufferSmall Page ITLB
C titi l Sh d D d th d’ Reservation StationCompetitively Shared Depends on thread’s dynamic behavior
Reservation StationCachesData TLB2nd level TLB
Unaware No SMT impact Execution units
SMT efficient due to
Unaware No SMT impact Execution units
24
SMT efficient due to
minimal replication of logic
SMT Performance Chart
34%35%
40% Performance Gain SMT enabled vs disabled
16%
29%
20%
25%
30%
7%10%
13%16%
5%
10%
15%
0%
5%
Floating Point 3dsMax* Integer Cinebench* 10POV-Ray* 3.7 beta 25
3DMark* Vantage* CPUIntel® Core™ i7
SPEC, SPECint, SPECfp, and SPECrate are trademarks of the Standard Performance Evaluation Corporation. For more information on SPEC benchmarks see: http://www spec org
®
Floating Point is based on SPECfp_rate_base2006* estimateInteger is based on SPECint_rate_base2006* estimate
25
Source: Intel. Configuration: pre-production Intel® Core™ i7 processor with 3 channel DDR3 memory. Performance tests and ratings are measured using specific computer systems and / or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/
For more information on SPEC benchmarks, see: http://www.spec.org
Agenda• Intel® Core™ Microarchitecture (Nehalem)
Design PhilosophyE h d P C• Enhanced Processor Core
• Performance Features• Intel® Hyper-Threading Technology
• Feeding the Engine• New Memory Hierarchy• New Platform Architecture
• Performance Acceleration• Virtualization• New Instructions• New Instructions
26
Feeding the Execution Engine
• Powerful 4-wide dynamic execution engine• Need to keep providing fuel to the execution engine• Need to keep providing fuel to the execution engine• Intel® Core™ Microarchitecture (Nehalem) Goals
– Low latency to retrieve dataKeep execution engine fed w/o stalling– Keep execution engine fed w/o stalling
– High data bandwidth– Handle requests from multiple cores/threads seamlessly
– ScalabilityScalability– Design for increasing core counts
• Combination of great cache hierarchy and new platform
Intel® Core™ microarchitecture (Nehalem)
designed to feed the execution engine
27
g g
Designed For Modularity
CoreCORE
CORE
CORE
DRAM
E EE …
L3 CacheDRAM
Intel QPIIntel QPI
UncoreIMC
Intel®
QPI
Power Power &&
ClockClock……
Intel®
QPI
#QPI#QPI## Si fSi f PowerPowerType ofType of IntegratedIntegrated
Differentiation in the “Uncore”:
…
Intel® QPI: Intel® QuickPath Interconnect (Intel® QPI)
#QPI#QPILinksLinks
# mem# memchannelschannels
Size ofSize ofcachecache# cores# cores
PowerPowerManageManage--
mentmentType ofType ofMemoryMemory
IntegratedIntegratedgraphicsgraphics
2008 – 2009 Servers & Desktops
28
Optimal price / performance / energy efficiencyOptimal price / performance / energy efficiencyfor server, desktop and mobile productsfor server, desktop and mobile products
Intel® Smart Cache – Core Caches
• New 3-level Cache Hierarchy• 1st level caches
– 32kB Instruction cache– 32kB, 8-way Data Cache
– Support more L1 misses in parallel than Intel® Core™2 microarchitecture
Core32kB L1
Data Cache32kB L1
Inst. CacheIntel® Core™2 microarchitecture
• 2nd level Cache– New cache introduced in Intel® Core™
microarchitecture (Nehalem)microarchitecture (Nehalem)– Unified (holds code and data)– 256 kB per core (8-way)– Performance: Very low latency
256kBL2 Cache
– Performance: Very low latency– 10 cycle load-to-use
– Scalability: As core count increases, reduce pressure on shared cache
29
educe p essu e o s a ed cac e
Intel® Smart Cache – 3rd Level Cache
• Shared across all cores• Size depends on # of cores
– Quad-core: Up to 8MB (16-ways)– Scalability:
– Built to vary size with varied core counts
CoreL1 Caches
CoreL1 Caches
CoreL1 Caches
counts– Built to easily increase L3 size in
future parts
• Perceived latency depends on
…
L3 C h
L2 Cache L2 Cache L2 Cache
e ce ed ate cy depe ds ofrequency ratio between core & uncore
• Inclusive cache policy for best
L3 Cache
u a po y o bperformance– Address residing in L1/L2 must be
present in 3rd level cache
30
Why Inclusive?
• Inclusive cache provides benefit of an on-die snoop filter• Core Valid Bits• Core Valid Bits
– 1 bit per core per cache line– If line may be in a core, set core valid bit– Snoop only needed if line is in L3 and core valid bit is setSnoop only needed if line is in L3 and core valid bit is set– Guaranteed that line is not modified if multiple bits set
• Scalability– Addition of cores/sockets does not increase snoop traffic seen by Addition of cores/sockets does not increase snoop traffic seen by
cores
• Latency– Minimize effective cache latency by eliminating cross-core snoops Minimize effective cache latency by eliminating cross core snoops
in the common case– Minimize snoop response time for cross-socket cases
31
Hardware Prefetching (HWP)
• HW Prefetching critical to hiding memory latency• Structure of HWPs similar as in Intel® Core™2 Structure of HWPs similar as in Intel Core 2
microarchitecture– Algorithmic improvements in Intel® Core™ microarchitecture
(Nehalem) for higher performancef h• L1 Prefetchers
– Based on instruction history and/or load address pattern• L2 Prefetchers
P f t h l d /RFO / d f t h b d dd tt– Prefetches loads/RFOs/code fetches based on address pattern– Intel Core microarchitecture (Nehalem) changes:
– Efficient Prefetch mechanism – Remove the need for Intel® Xeon® processors to disable HWP® ® p
– Increase prefetcher aggressiveness– Locks on address streams quicker, adapts to change faster, issues more
prefetchers more aggressively (when appropriate)
32
Today’s Platform Architecture
Front-Side Bus Evolution
CPUCPU
CPUCPU
memoryCPUCPU
CPUCPU
memoryCPUCPU
CPUCPU
memory
ICHICH
CPUCPUMCHMCH
ICHICH
CPUCPUMCHMCH
ICHICH
CPUCPUMCHMCH
ICHICHCPUCPU
ICHICHCPUCPU
ICHICHCPUCPU
33
Intel® Core™ Microarchitecture (Nehalem-EP) Platform Architecture(Nehalem EP) Platform Architecture
• Integrated Memory Controller– 3 DDR3 channels per socket3 DDR3 channels per socket– Massive memory bandwidth– Memory Bandwidth scales with
# of processorsV l l t
NehalemEP
NehalemEP
– Very low memory latency
• Intel® QuickPath Interconnect (Intel® QPI)– New point-to-point interconnect
Tylersburg EP
New point to point interconnect– Socket to socket connections– Socket to chipset connections– Build scalable solutions
Significant performance leap from new platform
34
Intel® Core™ microarchitecture (Nehalem-EP)Intel® Next Generation Server Processor Technology (Tylersburg-EP)
Intel® QuickPath Interconnect
• Intel® Core™ microarchitecture (Nehalem) introduces new Intel® QuickPath Interconnect
NehalemEP
NehalemEPQ
(Intel® QPI)• High bandwidth, low latency
point to point interconnectp p• Up to 6.4 GT/sec initially
– 6.4 GT/sec -> 12.8 GB/sec– Bi-directional link -> 25.6 IOHBi directional link > 25.6
GB/sec per link– Future implementations at even
higher speeds
IOHmemory
CPU CPU memory
• Highly scalable for systems with varying # of sockets
CPU CPUIOH
memory memory
35
Intel® CoreTM microarchitecture (Nehalem-EP)
Integrated Memory Controller (IMC)• Memory controller optimized per • Memory controller optimized per
market segment• Initial Intel® Core™
microarchitecture (Nehalem) productsproducts– Native DDR3 IMC– Up to 3 channels per socket– Massive memory bandwidth– Designed for low latency
NehalemEP
NehalemEP
DDR3 DDR3Designed for low latency
– Support RDIMM and UDIMM– RAS Features
• Future products– Scalability
Tylersburg EP
Scalability– Vary # of memory channels– Increase memory speeds– Buffered and Non-Buffered solutions
– Market specific needsp– Higher memory capacity – Integrated graphics
Significant performance through new IMC
36
Intel® Core™ microarchitecture (Nehalem-EP)Intel® Next Generation Server Processor Technology (Tylersburg-EP)
Memory Bandwidth – Initial Intel® Core™ Microarchitecture (Nehalem) Products( )
• 3 memory channels per socket• ≥ DDR3-1066 at launch
Stream Bandwidth Stream Bandwidth –– Mbytes/Sec (Triad)Mbytes/Sec (Triad)≥ DDR3 1066 at launch
• Massive memory BW• Scalability
– Design IMC and core to take
33376
3.4Xgadvantage of BW
– Allow performance to scale with cores– Core enhancements 9776Core enhancements
– Support more cache misses per core
– Aggressive hardware prefetching w/ throttling enhancements
E l IMC F
9776
6102
HTN 3.16/ BF1333/ HTN 3.00/ SB1600/ NHM 2.66/ 6.4 QPI/– Example IMC Features
– Independent memory channels – Aggressive Request Reordering
Massive memory BW provides performance and scalability
667 MHz mem 800 MHz mem 1066 MHz mem
Source: Intel Internal measurements Source: Intel Internal measurements –– August 2008August 200811
37
Massive memory BW provides performance and scalability
1HTN: Intel® Xeon® processor 5400 Series (Harpertown)
NHM: Intel® Core™ microarchitecture (Nehalem)
Non-Uniform Memory Access (NUMA)
• FSB architecture– All memory in one location
• Starting with Intel® Core™ microarchitecture (Nehalem)– Memory located in multiple
places
NehalemEP
NehalemEP
p• Latency to memory dependent
on location• Local memory
Hi h t BW
Tylersburg EP
– Highest BW– Lowest latency
• Remote Memory – Higher latencyHigher latency
Ensure software is NUMA-optimized for best performance
38
Intel® Core™ microarchitecture (Nehalem-EP)Intel® Next Generation Server Processor Technology (Tylersburg-EP)
Memory Latency Comparison
• Low memory latency critical to high performance • Design integrated memory controller for low latency• Need to optimize both local and remote memory latencyp y y• Intel® Core™ microarchitecture (Nehalem) delivers
– Huge reduction in local memory latency– Even remote memory latency is fast
• Effective memory latency depends per application/OSEffective memory latency depends per application/OS– Percentage of local vs. remote accesses– Intel Core microarchitecture (Nehalem) has lower latency regardless of mix
Relative Memory Latency Comparison 1
0 40
0.60
0.80
1.00
emor
y La
tenc
y
0.00
0.20
0.40
Harpertow n (FSB 1600) Nehalem (DDR3-1067) Local Nehalem (DDR3-1067) Remote
Rel
ativ
e M
e
39
1Next generation Quad-Core Intel® Xeon® processor (Harpertown)
Intel® CoreTM microarchitecture (Nehalem)
Agenda• Intel® Core™ Microarchitecture (Nehalem)
Design PhilosophyE h d P C• Enhanced Processor Core
• Performance Features• Intel® Hyper-Threading Technology
• Feeding the Engine• New Memory Hierarchy• New Platform Architecture
• Performance Acceleration• Virtualization• New Instructions• New Instructions
40
Virtualization
• To get best virtualized performance– Have best native performance– Reduce:– Reduce:
– # of transitions into/out of virtual machine– Latency of transitions
• Intel® Core™ microprocessor (Nehalem) virtualization p ( )features– Reduced latency for transitions – Virtual Processor ID (VPID) to reduce effective cost of transitions
d d bl ( ) d f i i– Extended Page Table (EPT) to reduce # of transitions
Great virtualization performance with Intel®Core™ microarchitecture (Nehalem)
41
Latency of Virtualization Transitions
• Microarchitectural– Huge latency reduction
generation over generation
Round Trip Virtualization Latency 1
generation over generation – Nehalem continues the
trend
• Architectural 60%
80%
100%
La
ten
cy
• Architectural– Virtual Processor ID (VPID)
added in Intel® Core™ microarchitecture
20%
40%
60%
Re
lati
ve
L
(Nehalem)– Removes need to flush
TLBs on transitions
0%Merom Penryn Nehalem
Higher Virtualization Performance ThroughLower Transition Latencies
42
1Intel® Core™ microarchitecture (formerly Merom)45nm next generation Intel® Core™ microarchitecture (Penryn)Intel® Core™ microarchitecture (Nehalem)
Extended Page Tables (EPT) Motivation
VM • A VMM needs to protect Guest OS
VM1
CR3
• A VMM needs to protect physical memory• Multiple Guest OSs share
the same physical memoryCR3
Guest Page Table
the same physical memory• Protections are
implemented through page-table virtualizationGuest page table changes
VMM• Page table virtualization
accounts for a significant portion of
p g gcause exits into the VMM
CR3
Active Page Table
virtualization overheads• VM Exits / Entries
• The goal of EPT is to greduce these overheadsVMM maintains the active
page table, which is used by the CPU
43
EPT SolutionEPT
l®Guest
CR3
Guest
EPT
Base Pointer
HostIntel® 64
Page TablesLinear
Address
EPTPage TablesPhysical
Address
Physical
Address
• Intel® 64 Page Tables– Map Guest Linear Address to Guest Physical Address– Can be read and written by the guest OS– Can be read and written by the guest OS
• New EPT Page Tables under VMM Control– Map Guest Physical Address to Host Physical Address
f– Referenced by new EPT base pointer• No VM Exits due to Page Faults, INVLPG or CR3
accesses
44
Extending Performance and Energy Efficiency- Intel® SSE4.2 Instruction Set Architecture (ISA) Leadership in 2008
SSESSE44((4545nm CPUs)nm CPUs)
Accelerated Accelerated String and Text String and Text ProcessingProcessing
Faster XML parsingFaster search and pattern matchingNovel parallel data matching and comparison operations
STTNI
SSESSE44..22(Nehalem Core)(Nehalem Core)
SSESSE44..11(Penryn Core)(Penryn Core)
Accelerated Searching Accelerated Searching & Pattern Recognition & Pattern Recognition of Large Data Setsof Large Data Sets
Improved performance for Genome Mining, Handwriting recognition.Fast Hamming distance / Population count
operations
STTNISTTNIe.g. XML e.g. XML
accelerationacceleration
ATAATA(Application(Application
Targeted Targeted Accelerators)Accelerators)
New Communications New Communications CapabilitiesCapabilities
Hardware based CRC instruction Accelerated Network attached storageImproved power efficiency
p
ATA
POPCNTPOPCNTe.g. Genome e.g. Genome
MiningMining
CRCCRC3232e.g. iSCSI e.g. iSCSI ApplicationApplication
Improved power efficiency for Software I-SCSI, RDMA, and SCTP
Wh t h ld th li ti OS d VMM d d ?What should the applications, OS and VMM vendors do?:
Understand the benefits & take advantage of new instructions in 2008.
Provide us feedback on instructions ISV would like to see for
next generation of applications
45
STTNI - STring & Text New InstructionsOperates on strings of bytes or words (16b)
Equal Each Instruction
True for each character in Src2 if same position in Src1 is equalSrc1: Test\tdaySrc2: tad tseT
STTNI MODEL
Source2 (XMM / M128) Bit 0
Mask: 01101111
Equal Any Instruction Ranges Instruction
x x xx xxx Tx x xx Txx x
t da st TeTe
Check each bit in the diagonal
Bit 0
True for each character in Src2 if any character in Src1 matchesSrc1: Example\nSrc2: atad tsTMask: 10100000
True if a character in Src2 is in at least one of up to 8 ranges in Src1Src1: AZ’0’9zzzSrc2: taD tseTM k 00100001 x T xx xxx x
x x xx xTx x
x x Fx xxx xx x xx xxT x
st
d\t
Sourc
e1 (
XM
M)
Equal Ordered Instruction
Finds the start of a substring (Src1) ithi th t i (S 2)
Mask: 00100001 x T xx xxx xx x xT xxx xF x xx xxx x
ad
yIntRes10 1 01 111 1
within another string (Src2)Src1: ABCA0XYZSrc2: S0BACBABMask: 00000010
0 1 01 111 1
46
Projected 3.8x kernel speedup on XML parsing & 2.7x savings on instruction cycles
STTNI ModelSource2 (XMM / M128)Source2 (XMM / M128)
EQUAL ANY EQUAL EACH
x x xx xxx Tx x xx Txx xx x xx xTx x
t da st TeT
se
Check each bit in the diagonal
XM
M)
Bit 0
Bit 0
F F FF FF FFF F FF FFF FT T FF FFF F
a a dt t TsE
ax
OR results down each column
XM
M)
( )Bit 0
Bit 0
x T xx xxx xx x Fx xxx xx x xx xxT x
x x xT xxx xF x xx xxx x
t
ad\t
y
Sourc
e1 (
X
F F FF FFF FF F FF FFF FF F FF FFF F
F F FF FFF FF F FF FFF F
m
elp
\n
Sourc
e1 (
X
IntRes10 101 111 1IntRes11 1 00 000 0
Source2 (XMM / M128)Source2 (XMM / M128)
EQUAL ORDEREDRANGES
AND the results
along each
diagonalM
M)
Bit 0fF F TfF TFF FfF T FfF FTF xfF F FfF xFT x
Source2 (XMM / M128)
S BA0 ABC BA
CB
Bit 0
F T FF FF TTF TTF FFF TT TFT TTT T
t Da st TeA
‘0’Z
MM
)
Source2 (XMM / M128)Bit 0
•First Compare
does GE, next
d LE
Bit 0AND the results
along each
diagonal
Bit 0
MM
)
Sour
ce1
(XM fF F FfF xFT x
fF F TfF xxF xfT fTfTfT xxx xfT fT xfT xxx xfT x xfT xxx xfT x xx xxx x
CA
YX0
Z
T TTT TTT T
T TFT TTT T
F F FF FFF FF FTF FFF F
F F FF FFF FT TTT TTT T
09
zzz
z
Sourc
e1 (
XM does LE
•AND GE/LE pairs of results
•OR those results
Sourc
e1 (
X
47
fT x xx xxx xZIntRes10 0 00 100 0
T TTT TTT TzIntRes10 100 000 1
ATA - Application Targeted Accelerators
CRC32 POPCNTAccumulates a CRC32 value using the iSCSI polynomial POPCNT determines the number of nonzero
Old CRC
63 3132 0
DST X
Accumulates a CRC32 value using the iSCSI polynomial
0 1 0 . . . 0 0 1 1
63 1 0 Bit
RAX
POPCNT determines the number of nonzero
bits in the source.
SRC Data 8/16/32/64 bit
Σ
0 New CRC
63 3132 0
DST
0
0x3 RBX
0 ZF=? 0
One register maintains the running CRC value as a software loop iterates over data. Fixed CRC polynomial = 11EDC6F41h
Replaces complex instruction sequences for CRC in Upper layer data protocols:
POPCNT is useful for speeding up fast matching in data mining workloads including:• DNA/Genome Matching• Voice Recognition
Upper layer data protocols:• iSCSI, RDMA, SCTP
Enables enterprise class data assurance with high data rates in networked storage in any user environment
ZFlag set if result is zero. All other flags (C,S,O,A,P) reset
48
in networked storage in any user environment.
Tools Support of New Instructions• Intel® Compiler 10.x supports the new instructions
N h l ifi il ti i ti– Nehalem specific compiler optimizations– SSE4.2 supported via vectorization and intrinsics – Inline assembly supported on both IA-32 and Intel® 64 architecture targets– Necessary to include required header files in order to access intrinsics
• Intel® XML Software Suite– High performance C++ and Java runtime libraries– Version 1.0 (C++), version 1.01 (Java) available now– Version 1.1 w/SSE4.2 optimizations planned for September 2008
• Microsoft Visual Studio* 2008 VC++– SSE4.2 supported via intrinsicspp– Inline assembly supported on IA-32 only– Necessary to include required header files in order to access intrinsics– VC++ 2008 tools masm, msdis, and debuggers recognize the new instructions
• Sun Studio Express* 7/08– Supports Intel® CoreTM microarchitecture (Merom) 45nm next generation Intel® Core™ microarchitecture (Penryn) Intel® CoreTM– Supports Intel® CoreTM microarchitecture (Merom), 45nm next generation Intel® Core microarchitecture (Penryn), Intel® CoreTM
microarchitecture (Nehalem)– SSE4.1, SSE4.2 through intrinsics– Nehalem specific compiler optimizations
• GCC* 4.3.1– Support Intel Core microarchitecture (Merom) 45nm next generation Intel Core microarchitecture (Penryn) Intel Core – Support Intel Core microarchitecture (Merom), 45nm next generation Intel Core microarchitecture (Penryn), Intel Core
microarchitecture (Nehalem)– via –mtune=generic. – Support SSE4.1 and SSE4.2 through vectorizer and intrinsics
49
Broad Software Support for Intel®Core™ Microarchitecture (Nehalem)
Software Optimization Guidelines
• Most optimizations for Intel® Core™microarchitecture still hold
• Examples of new optimization guidelines:– 16-byte unaligned loads/stores
Enhanced macrofusion rules– Enhanced macrofusion rules– NUMA optimizations
• Intel® Core™ microarchitecture (Nehalem) SW Optimization Guide will be published
• Intel® Compiler will support settings for Intel Core microarchitecture (Nehalem) optimizationsmicroarchitecture (Nehalem) optimizations
50
Agenda• Intel® Core™ microarchitecture (Nehalem)
power management overviewMi i i i idl ti• Minimizing idle power consumption
• Performance when you need it
51
Intel® Core™ Microarchitecture (Nehalem) Design GoalsWorld class performance combined with superior energy efficiency World class performance combined with superior energy efficiency –– Optimized for: Optimized for:
Dynamically scaled performance when
Single ThreadSingle Thread
p
Existing AppsExisting AppsEmerging AppsEmerging Apps
All UsagesAll Usages
MultiMulti--threadsthreads
needed to maximize energy efficiency
A single, scalable, foundation optimized across each segment and power envelope A single, scalable, foundation optimized across each segment and power envelope
Workstation / ServerDesktop / Mobile
52
A Dynamic and Design Scalable A Dynamic and Design Scalable MicroarchitectureMicroarchitecture
Power Control UnitBCLKV
Core Core
VccVcc PLL
BCLKVcc
Integrated proprietary Integrated proprietary PLLFreqFreq..
SensorsSensors
Core Core
VccVcc
Integrated proprietary Integrated proprietary microcontrollermicrocontrollerShifts control from Shifts control from hardware to embedded hardware to embedded fifiVccVcc
FreqFreq..SensorsSensors
Core Core
PLL
PCUPCU
firmwarefirmwareReal time sensors for Real time sensors for temperature, current, temperature, current, powerpower
VccVccFreqFreq..
SensorsSensors
CoreCore
PLL
PCUPCU powerpowerFlexibility enables Flexibility enables sophisticated sophisticated algorithms, tuned for algorithms, tuned for
UncoreUncore, , LLCLLC
Core Core
VccVccFreqFreq..
SensorsSensorsPLL
current operating current operating conditionsconditions
53
Agenda• Intel® Core™ microarchitecture (Nehalem)
power management overviewMi i i i idl ti• Minimizing idle power consumption
• Performance when you need it
54
Idle Power Matters
• Data center operating costs1
41M physical servers by 2010 average utilization < 10%– 41M physical servers by 2010, average utilization < 10%– $0.50 spent on power and cooling for every $1 spent on
server hardware
• Regulatory requirements affect all segments– ENERGY STAR* and related requirements
• Environmental responsibility• Environmental responsibility
Idle power consumption not just mobile concern
55
1. IDC’s Datacenter Trends Survey, January 2007
CPU Core Power Consumption
• High frequency processes are leaky– Reduced via high-K metal gate process,
design technologies, manufacturing Leakage
g g , goptimizations
56
CPU Core Power Consumption
• High frequency designs require high performance global clock distribution Clock
Di t ib tip g
• High frequency processes are leaky– Reduced via high-K metal gate process,
design technologies, manufacturing Leakage
Distribution
g g , goptimizations
57
CPU Core Power ConsumptionTotal Core Power
Consumption
• Remaining power in logic, local clocks– Power efficient microarchitecture, good
Local Clocks
and LogicPower efficient microarchitecture, good clock gating minimize waste
• High frequency designs require high performance global clock distribution Clock
Di t ib ti
and Logic
p g• High frequency processes are leaky
– Reduced via high-K metal gate process, design technologies, manufacturing
Leakage
Distribution
g g , goptimizations
Challenge – Minimize power when idle
58
Minimizing Idle Power Consumption
• Operating system notifies CPU when no tasks are ready for executionready for execution– Execution of MWAIT instruction
• MWAIT arguments hint at expected idle durationg p– Higher numbered C-states
lower power, but alsolonger exit latency
C0
r (W
)
g y
• CPU idle states referredto as “C-States”
CnC1
dle
Pow
erCn
Exit Latency (us)
I
59
C-State Support Before Intel® Core™ Microarchitecture (Nehalem)Microarchitecture (Nehalem)
• C0: CPU active stateActive Core Power
Local Clocks
and Logic
Clock
and Logic
Leakage
Distribution
60
C-State Support Before Intel® Core™ Microarchitecture (Nehalem)Microarchitecture (Nehalem)
• C0: CPU active stateC1 C2 states (early 1990s):
Active Core Power
• C1, C2 states (early 1990s):• Stop core pipeline• Stop most core clocks
Local Clocks
and Logic
Clock
and Logic
Leakage
Distribution
61
C-State Support Before Intel® Core™ Microarchitecture (Nehalem)Microarchitecture (Nehalem)
• C0: CPU active stateC1 C2 states (early 1990s):
Active Core Power
• C1, C2 states (early 1990s):• Stop core pipeline• Stop most core clocks
• C3 state (mid 1990s):• Stop remaining core clocks
Clock
Leakage
Distribution
62
C-State Support Before Intel® Core™ Microarchitecture (Nehalem)Microarchitecture (Nehalem)
• C0: CPU active stateC1 C2 states (early 1990s):
Active Core Power
• C1, C2 states (early 1990s):• Stop core pipeline• Stop most core clocks
• C3 state (mid 1990s):• Stop remaining core clocks
C4 C5 C6 t t ( id 2000 )• C4, C5, C6 states (mid 2000s):• Drop core voltage, reducing leakage• Voltage reduction via shared VR
Leakage
Existing C-states significantly reduce idle power
63
C-State Support Before Intel® Core™ Microarchitecture (Nehalem)Microarchitecture (Nehalem)
• Cores share a single voltage planeAll cores must be idle before voltage reduced– All cores must be idle before voltage reduced
– Independent VR’s pre core prohibitive from cost and form factor perspective
• Deepest C-states have relatively long exit latencies– System / VR handshake, ramp voltage, restore state,
restart pipeline, etc.p p
Deepest C-states available in mobile products
64
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)Processor (Penryn)
65
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)Processor (Penryn)
Core Power Cores running Cores running
Core 1
Core Powerapplications.applications.
Core 1
0
Core 0
Time
0
66
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)Processor (Penryn)
Core PowerTask completes. No work Task completes. No work waiting. OS executes waiting. OS executes
Core 1
Core PowerMWAIT(CMWAIT(C66) instruction.) instruction.
Core 1
0
Core 0
Time
0
67
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)Processor (Penryn)
Core Power
Execution stops. Core Execution stops. Core architectural state saved. architectural state saved. Core clocks stopped CoreCore clocks stopped Core
Core 1
Core Power Core clocks stopped. Core Core clocks stopped. Core 0 continues execution 0 continues execution undisturbed.undisturbed.
Co e
0
Core 0
Time
0
68
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)Processor (Penryn)
Core Power
Core 1
Core Power
Core 1
0Task completes. No work Task completes. No work waiting. OS executes waiting. OS executes
Core 0
MWAIT(CMWAIT(C66) instruction. ) instruction. Core enters CCore enters C66..
Time
0
69
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)Processor (Penryn)
Core Power
Core 1
Core Power
0VR voltage VR voltage reduced. Powerreduced. Power
Core 0
reduced. Power reduced. Power drops.drops.
Time
0
70
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)Processor (Penryn)
Core Power
Interrupt for Core 1 arrives. VR Interrupt for Core 1 arrives. VR voltage increased. Core 1 clocks voltage increased. Core 1 clocks turn on, core state restored, andturn on, core state restored, and
Core 1
Core Power turn on, core state restored, and turn on, core state restored, and core resumes execution at core resumes execution at instruction following MWAIT(C6). instruction following MWAIT(C6). Cores 0 remains idle.Cores 0 remains idle.
0
Core 0
Time
0
71
C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)Processor (Penryn)
Core Power
Core 1
Core Power
Core 1
0Interrupt for Core 0 arrives. Core 0 Interrupt for Core 0 arrives. Core 0 returns to C0 and resumes execution at returns to C0 and resumes execution at instruction following MWAIT(C6). Core instruction following MWAIT(C6). Core 1 continues execution undisturbed1 continues execution undisturbed
Core 0
1 continues execution undisturbed.1 continues execution undisturbed.
Time
0
C6 significantly reduces idle power consumption
72
IntelIntel®® Core™ Microarchitecture Core™ Microarchitecture (Nehalem): Integrated Power Gate(Nehalem): Integrated Power Gate(Nehalem): Integrated Power Gate(Nehalem): Integrated Power Gate
•• Integrated power switch Integrated power switch between VR output and between VR output and
VCC
between VR output and between VR output and core voltage supplycore voltage supply
–– Very low onVery low on--resistanceresistanceV hi h ffV hi h ff i ti t
Core0 Core1 Core2 Core3–– Very high offVery high off--resistanceresistance–– Much faster voltage ramp Much faster voltage ramp
than external VRthan external VR
•• Enables per core C6 stateEnables per core C6 state
Memory System, Cache, I/O VTT
•• Enables per core C6 stateEnables per core C6 state–– Individual cores transition to Individual cores transition to
~0 power state~0 power state–– Transparent to other cores Transparent to other cores –– Transparent to other cores, Transparent to other cores,
platform, software, and VRplatform, software, and VR
Close collaboration with process technology
7373
to optimize device characteristics
Intel® Core™ Microarchitecture (Nehalem) Core C-State Support(Nehalem) Core C State Support
• C0: CPU active stateActive Core Power
Local Clocks
and Logic
Clock
and Logic
Leakage
Distribution
74
Intel® Core™ Microarchitecture (Nehalem) Core C-State Support(Nehalem) Core C State Support
• C0: CPU active stateC1 t t
Active Core Power
• C1 state:• Stop core pipeline• Stop most core clocks
Local Clocks
and Logicp
Clock
and Logic
Leakage
Distribution
75
Intel® Core™ Microarchitecture (Nehalem) Core C-State Support(Nehalem) Core C State Support
• C0: CPU active stateC1 t t
Active Core Power
• C1 state:• Stop core pipeline• Stop most core clocksp
• C3 state:• Stop remaining core clocks
Clock
Leakage
Distribution
76
Intel® Core™ Microarchitecture (Nehalem) Core C-State Support(Nehalem) Core C State Support
• C0: CPU active stateC1 t t
Active Core Power
• C1 state:• Stop core pipeline• Stop most core clocksp
• C3 state:• Stop remaining core clocks
C6 t t• C6 state:• Processor saves architectural state• Turn off power gate, eliminating
Leakagep g , g
leakage
Core idle power goes to ~0
77
C6 on Intel® Core™ Microarchitecture (Nehalem)(Nehalem)
78
C6 on Intel® Core™ Microarchitecture (Nehalem)(Nehalem)Core Power
CoresCores 00 11 22Core 3
0
Cores Cores 00, , 11, , 22, , and and 3 3 running running applications.applications.
Core 2
0
C 0
Core 1
0
Core 0
Time
0
79
Time
C6 on Intel® Core™ Microarchitecture (Nehalem)(Nehalem)Core Power Task completes. No work Task completes. No work
waiting. OS executes waiting. OS executes
Core 3
0
MWAIT(CMWAIT(C66) instruction.) instruction.
Core 2
0
C 0
Core 1
0
Core 0
Time
0
80
Time
C6 on Intel® Core™ Microarchitecture (Nehalem)(Nehalem)Core Power
Execution stops. Core Execution stops. Core architectural state saved. architectural state saved.
Core 3
0
Core clocks stopped. Cores Core clocks stopped. Cores 00, , 11, and , and 3 3 continue continue execution undisturbed.execution undisturbed.
Core 2
0
C 0
Core 1
0
Core 0
Time
0
81
Time
C6 on Intel® Core™ Microarchitecture (Nehalem)(Nehalem)Core Power
Core power gate turned off. Core power gate turned off. Core voltage goes to 0Core voltage goes to 0
Core 3
0
Core voltage goes to 0. Core voltage goes to 0. Cores 0, 1, and 3 continue Cores 0, 1, and 3 continue execution undisturbed.execution undisturbed.
Core 2
0
C 0
Core 1
0
Core 0
Time
0
82
Time
C6 on Intel® Core™ Microarchitecture (Nehalem)(Nehalem)Core Power
Core 3
0
Core 2
0 Task completes. No work waiting. Task completes. No work waiting. OS t MWAIT(COS t MWAIT(C66))
C 0
Core 1
0
OS executes MWAIT(COS executes MWAIT(C66) ) instruction. Core instruction. Core 0 0 enters Centers C66. . Cores Cores 1 1 and and 3 3 continue continue execution undisturbed.execution undisturbed.
Core 0
Time
0
83
Time
C6 on Intel® Core™ Microarchitecture (Nehalem)(Nehalem)Core Power Interrupt for Core Interrupt for Core 2 2 arrives. Core arrives. Core
2 2 returns to Creturns to C00, execution , execution t i t ti f ll it i t ti f ll i
Core 3
0
resumes at instruction following resumes at instruction following MWAIT(CMWAIT(C66). Cores ). Cores 1 1 and and 3 3 continue execution undisturbed.continue execution undisturbed.
Core 2
0
C 0
Core 1
0
Core 0
Time
0
84
Time
C6 on Intel® Core™ Microarchitecture (Nehalem)(Nehalem)Core Power
Core 3
0
Core 2
0Interrupt for Core 0 arrives. Power Interrupt for Core 0 arrives. Power gate turns on, core clock turns on, gate turns on, core clock turns on, core state restored, core resumes core state restored, core resumes
Co e 0
Core 1
0
,,execution at instruction following execution at instruction following MWAIT(C6). Cores 1, 2, and 3 MWAIT(C6). Cores 1, 2, and 3 continue execution undisturbed.continue execution undisturbed.
Core 0
Time
0
85
TimeCore independent C6 on Intel Coremicroarchitecture (Nehalem) extends benefits
Intel® Core™ Microarchitecture (Nehalem)-based Processor
M C t ll
Core
Total CPU Power Consumption
Memory ControllerMisc
Misc Core
Clocks and Logic
s (x
N)
Core Core Core Corec
IO
c
IO
Queue
Core Leakage
Core Clock Distribution
Core
sQPI
1
QPI
0Shared L3 Cache
e
• Significant logic outside core– Integrated memory controller
Leakage
I/O
Uncore Logic
Integrated memory controller– Large shared cache– High speed interconnect
A bi i l iUncore L k
Uncore Clock Distribution
/
86
– Arbitration logic Leakage
QPI = Intel® QuickPath Interconnect (Intel® QPI)
Intel® Core™ Microarchitecture (Nehalem) Package C-State Support(Nehalem) Package C State Support
Active CPU PowerCore
Clocks • All cores in C6 state:
– Core power to ~0
Core Clock Di t ib ti
Clocks and Logic
es (
x N
)– Core power to ~0
Core Leakage
Distribution
Core
Unco e Clock I/O
Uncore Logic
Uncore Leakage
Uncore Clock Distribution
87
g
Intel® Core™ Microarchitecture (Nehalem) Package C-State Support(Nehalem) Package C State Support
Active CPU Power• All cores in C6 state:
– Core power to ~0– Core power to ~0
• Package to C6 state:– Uncore logic stops toggling
Un o e Clo k I/O
Uncore Logic
Uncore Leakage
Uncore Clock Distribution
88
ea age
Intel® Core™ Microarchitecture (Nehalem) Package C-State Support(Nehalem) Package C State Support
Active CPU Power• All cores in C6 state:
– Core power to ~0– Core power to ~0
• Package to C6 state:– Uncore logic stops toggling– I/O to lower power state
Unco e Clock I/O
Uncore Leakage
Uncore Clock Distribution
89
g
Intel® Core™ Microarchitecture (Nehalem) Package C-State Support(Nehalem) Package C State Support
• All cores in C6 state:– Core power to ~0
Active CPU Power
– Core power to ~0
• Package to C6 state:– Uncore logic stops toggling– I/O to lower power state– Uncore clock grids stopped
Uncore Clock I/O
Uncore Leakage
Uncore Clock Distribution
Substantial reduction in
90
gSubstantial reduction inidle CPU power
Reducing Platform Idle Power• Dramatic improvements in CPU idle power increase
importance of platform improvements• Memory power:• Memory power:
– Memory clocks stopped between requests at low utilization– Memory to self refresh in package C3, C6
• Link power:– Intel® QuickPath Interconnect links to lower power states
as CPU becomes less activeas CPU becomes less active– PCI Express* links on chipset have similar behavior
• Hint to VR to reduce phases during periods of low t d dcurrent demand
Intel® Core™ microarchitecture (Nehalem)
91
Intel® Core™ microarchitecture (Nehalem)reduces CPU and platform power
C-State Exit Latency
• PCU monitors interrupt rate, and over-ridesC state choice when desirableC-state choice when desirable– Operating system C-state choice depends on CPU utilization– On some workloads, CPU utilization is low but latency is
important– Flexibility provided by PCU allows for complex algorithms to
optimize behavior
Intel® Core™ microarchitecture (Nehalem)Intel® Core microarchitecture (Nehalem)provides benefit of low idle power
without hurting performance
92
Agenda• Intel® Core™ microarchitecture (Nehalem)
power management overviewMi i i i idl ti• Minimizing idle power consumption
• Performance when you need it
93
Managing Active Power
• Operating system changes frequency as needed to meet performance needs, minimize power
E h d I t l S dSt ® T h l– Enhanced Intel SpeedStep® Technology– Referred to as processor P-States
• PCU tunes voltage for given frequency, operating g g q y, p gconditions, and silicon characteristics
• When core(s) enter deep C-states, voltage eq ested f om VR is ed ced all else being eq alrequested from VR is reduced, all else being equal– Results in lower voltage on the die on workloads of interest– Results in lower power consumption at a given frequency
PCU automatically optimizes operating voltage
94
Turbo Mode: Key to Scalability Goal
• Intel® Core™ microarchitecture (Nehalem) is a scalable architecture
Nehalem(Nehalem) is a scalable architecture– High frequency core for performance in
less constrained form factors– Retain ability to use that frequency in y q y
very small form factors– Retain ability to use that frequency when
running lightly threaded or lower power kl dworkloads
• Turbo utilizes available frequency:– Maximizes both single-thread and multi-
th d f i th tthread performance in the same part
Turbo Mode provides performancewhen you need it
95
when you need it
Turbo Mode Before IntelTurbo Mode Before Intel®® Core™ Core™ Microarchitecture (Nehalem)Microarchitecture (Nehalem)Microarchitecture (Nehalem)Microarchitecture (Nehalem)
Cl k St dCl k St d
No Turbo
Clock StoppedClock Stopped
Power reduction in Power reduction in inactive coresinactive cores
No Turbo
requ
ency
(F)
Freq
uenc
y (F
)
Cor
e 0
Cor
e 1
Cor
e 0
Cor
e 1
Workload Lightly Threaded
Fr F
9696
Turbo Mode Before IntelTurbo Mode Before Intel®® Core™ Core™ Microarchitecture (Nehalem)Microarchitecture (Nehalem)Microarchitecture (Nehalem)Microarchitecture (Nehalem)
T b M dT b M dCl k St dCl k St d
No Turbo
Turbo ModeTurbo Mode
In response to workload In response to workload adds additional performance adds additional performance
bins within headroombins within headroom
Clock StoppedClock Stopped
Power reduction in Power reduction in inactive coresinactive cores
No Turbo
Cor
e 0
Cor
e 1
requ
ency
(F)
Freq
uenc
y (F
)
Cor
e 0
Workload Lightly Threaded
Fr F
9797
IntelIntel®® Core™ Microarchitecture Core™ Microarchitecture (Nehalem) Turbo Mode(Nehalem) Turbo Mode(Nehalem) Turbo Mode(Nehalem) Turbo Mode
P G tiP G ti
No Turbo
Power GatingPower Gating
Zero power for inactive Zero power for inactive corescores
No Turbo
requ
ency
(F)
Workload Lightly Threadedor < TDP
Freq
uenc
y (F
)
Cor
e 2
Cor
e 3
Cor
e 0
Cor
e 1
Cor
e 2
Cor
e 3
Cor
e 0
Cor
e 1
Fr F
9898
IntelIntel®® Core™ Microarchitecture Core™ Microarchitecture (Nehalem) Turbo Mode(Nehalem) Turbo Mode
T b M dT b M dP G tiP G ti
(Nehalem) Turbo Mode(Nehalem) Turbo Mode
No Turbo
Turbo ModeTurbo Mode
In response to workload In response to workload adds additional performance adds additional performance
bins within headroombins within headroom
Power GatingPower Gating
Zero power for inactive Zero power for inactive corescores
No Turbo
requ
ency
(F)
Freq
uenc
y (F
)
Cor
e 0
Cor
e 1
Workload Lightly Threadedor < TDPC
ore
0C
ore
1C
ore
2C
ore
3
Fr F
9999
IntelIntel®® Core™ Microarchitecture Core™ Microarchitecture (Nehalem) Turbo Mode(Nehalem) Turbo Mode
T b M dT b M dP G tiP G ti
(Nehalem) Turbo Mode(Nehalem) Turbo Mode
No Turbo
Turbo ModeTurbo Mode
In response to workload In response to workload adds additional performance adds additional performance
bins within headroombins within headroom
Power GatingPower Gating
Zero power for inactive Zero power for inactive corescores
No Turbo
requ
ency
(F)
Freq
uenc
y (F
)
Cor
e 0
Cor
e 1
Workload Lightly Threadedor < TDPC
ore
0C
ore
1C
ore
2C
ore
3
Fr F
100100
IntelIntel®® Core™ Microarchitecture Core™ Microarchitecture (Nehalem) Turbo Mode(Nehalem) Turbo Mode
A ti iA ti i
(Nehalem) Turbo Mode(Nehalem) Turbo Mode
No Turbo
Active cores running Active cores running workloads < TDPworkloads < TDP
No Turbo
Cor
e 2
Cor
e 3
Cor
e 0
Cor
e 1
Cor
e 2
Cor
e 3
Cor
e 0
Cor
e 1
requ
ency
(F)
Workload Lightly Threadedor < TDP
requ
ency
(F)
Cor
e 0
Cor
e 1
Cor
e 2
Cor
e 3
Fr F
101101
IntelIntel®® Core™ Microarchitecture Core™ Microarchitecture (Nehalem) Turbo Mode(Nehalem) Turbo Mode
T b M dT b M dA ti iA ti i
(Nehalem) Turbo Mode(Nehalem) Turbo Mode
No Turbo
Turbo ModeTurbo Mode
In response to workload In response to workload adds additional performance adds additional performance
bins within headroombins within headroom
Active cores running Active cores running workloads < TDPworkloads < TDP
No Turbo
requ
ency
(F)
requ
ency
(F)
Workload Lightly Threadedor < TDPC
ore
0C
ore
1C
ore
2C
ore
3
Cor
e 2
Cor
e 3
Cor
e 1
Cor
e 0
Fr F
102102
IntelIntel®® Core™ Microarchitecture Core™ Microarchitecture (Nehalem) Turbo Mode(Nehalem) Turbo Mode
T b M dT b M dP G tiP G ti
(Nehalem) Turbo Mode(Nehalem) Turbo Mode
No Turbo
Turbo ModeTurbo Mode
In response to workload In response to workload adds additional performance adds additional performance
bins within headroombins within headroom
Power GatingPower Gating
Zero power for inactive Zero power for inactive corescores
No Turbo
Cor
e 0
Cor
e 1
Cor
e 2
Cor
e 3
requ
ency
(F)
Freq
uenc
y (F
)
Workload Lightly Threadedor < TDP C
ore
2C
ore
3
Cor
e 1
Cor
e 0
Dynamically Delivering Optimal Performance
Fr F
103
y y g pand Energy Efficiency
103
Turbo Mode Enabling
• Turbo Mode exposed as additional Enhanced Intel SpeedStep® Technology operating pointSpeedStep® Technology operating point– Operating system treats as any other P-state, requesting
Turbo Mode when it needs more performancef b f f h h– Performance benefit comes from higher operating
frequency – no need to enable or tune software
• Turbo Mode is transparent to systemp y– Frequency transitions handled completely in hardware– PCU keeps silicon within existing operating limits– Systems designed to same specs with or without Turbo Systems designed to same specs, with or without Turbo
Mode
Performance benefits withexisting applications and operating systems
104
existing applications and operating systems
Summary
• Intel® Core™ microarchitecture (Nehalem) – The 45nm Tock• Designed for• Designed for
– Power Efficiency– Scalability
Performance– Performance
• Key Innovations:– Enhanced Processor Core
B nd Ne Pl tfo m A hite t e– Brand New Platform Architecture– Sophisticated Power Management
High Performance When You Need ItLower Power When You Don’t
105
Additional Sources of Information on This Topic:Information on This Topic:• Other Sessions / Chalk Talks / Labs:
– TCHS001: Next Generation Intel® Core™ Microarchitecture (Nehalem) Family of ( ) yProcessors: Screaming Performance, Efficient Power (8/19, 3:00 – 3:50)
– DPTS001: High End Desktop Platform Design Overview for the Next Generation Intel® Microarchitecture (Nehalem) Processor (8/20, 2:40 – 3:30)
– NGMS001: Next Generation Intel® Microarchitecture (Nehalem) Family: ® ( ) yArchitectural Insights and Power Management (8/19, 4:00 – 5:50)
– NGMC001: Chalk Talk: Next Generation Intel® Microarchitecture (Nehalem) Family (8/19, 5:50 – 6:30)
– NGMS002: Tuning Your Software for the Next Generation Intel® NGMS002: Tuning Your Software for the Next Generation Intel® Microarchitecture (Nehalem) Family (8/20, 11:10 – 12:00)
– PWRS003: Power Managing the Virtual Data Center with Windows Server* 2008 / Hyper-V and Next Generation Processor-based Intel® Servers Featuring Intel® Dynamic Power Technology (8/19, 3:00 – 3:50)y gy ( / , )
– PWRS005: Platform Power Management Options for Intel® Next Generation Server Processor Technology (Tylersburg-EP) (8/21, 1:40 – 2:30)
– SVRS002: Overview of the Intel® QuickPath Interconnect (8/21, 11:10 –12:00)
106
12:00)
Session Presentations - PDFs
The PDF for this Session presentation is available from our IDF Content Catalog at the end of the day at:end of the day at:
www.intel.com/idf
or
https://intel.wingateweb.com/US08/scheduler/public.jsp
107
Please Fill out the Session Evaluation Form
Place form in evaluation box Place form in evaluation box at the back of session room
Thank you for your input, we use it to improve f l lfuture Intel Developer Forum events
108
Q&AQ&A
109
Legal Disclaimer• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO
LICENSE EXPRESS OR IMPLIED BY ESTOPPEL OR OTHERWISE TO ANY INTELLECTUALLICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE MERCHANTABILITY OR INFRINGEMENT OF ANY PATENT COPYRIGHT ORPARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
• Intel may make changes to specifications and product descriptions at any time, without notice.• All products, dates, and figures specified are preliminary based on current expectations, and are subject to
change without notice.• Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which
may cause the product to deviate from published specifications. Current characterized errata are available on request.
• Merom Penryn Hapertown Nehalem Dothan Westmere Sandy Bridge and other code names featured areMerom, Penryn, Hapertown, Nehalem, Dothan, Westmere, Sandy Bridge, and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user
• Performance tests and ratings are measured using specific computer systems and/or components and reflect f f ff
g g p p y pthe approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
• Intel, Intel Inside, Intel Core, Pentium, Intel SpeedStep Technology, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
• *Other names and brands may be claimed as the property of others.
110
y p p y• Copyright © 2008 Intel Corporation.
Risk FactorsThis presentation contains forward-looking statements that involve a number of risks and uncertainties. These
d fl h l f d hp g
statements do not reflect the potential impact of any mergers, acquisitions, divestitures, investments or other similar transactions that may be completed in the future. The information presented is accurate only as of today’s date and will not be updated. In addition to any factors discussed in the presentation, the important factors that could cause actual results to differ materially include the following: Demand could be different from Intel's expectations due to factors including changes in business and economic conditions, including conditions in the credit market that could affect consumer confidence; customer acceptance of Intel’s and competitors’ products; changes in customer order patterns, including order cancellations; and changes in the l l f l’ l ld b ff d b h f l f d
p p g p g glevel of inventory at customers. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel operates in intensely competitive industries that are characterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of new Intel product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions, marketing programs and pricing pressures and Intel’s response to such actions; Intel’s ability to respond quickly to technological developments and to i f i i d d h il bili f ffi i l f f
p y p q y g pincorporate new features into its products; and the availability of sufficient supply of components from suppliers to meet demand. The gross margin percentage could vary significantly from expectations based on changes in revenue levels; product mix and pricing; capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; excess or obsolete inventory; manufacturing yields; changes in unit costs; impairments of long-lived assets, including manufacturing, assembly/test and intangible assets; and the timing and execution of the manufacturing ramp and associated costs, including start-up costs. Expenses, particularly certain marketing and compensation expenses, vary d di h l l f d d f I l' d h l l f d fi d i i f
g p p p y g p p ydepending on the level of demand for Intel's products, the level of revenue and profits, and impairments of long-lived assets. Intel is in the midst of a structure and efficiency program that is resulting in several actions that could have an impact on expected expense levels and gross margin. Intel's results could be impacted by adverse economic, social, political and physical/infrastructure conditions in the countries in which Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Intel's results could be affected by adverse effects associated with product defects and errata (deviations from published
ifi i ) d b li i i l i l i i ll l kh ld specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the report on Form 10-Q for the quarter ended June 28, 2008.
111
Backup Slides
112
Branch Prediction Reminder
• Goal: Keep powerful compute engine fed• Options:
– Stall pipeline while determining branch direction/target– Predict branch direction/target and correct if wrong
• Minimize amount of time wasted correcting from Minimize amount of time wasted correcting from incorrect branch predictions– Performance:
– Through higher branch prediction accuracyThrough higher branch prediction accuracy– Through faster correction when prediction is wrong
– Power efficiency: Minimize number of speculative/incorrect micro-ops that are executed
Continued focus on branch prediction improvements
113
p p
L2 Branch Predictor
• Problem: Software with a large code footprint not able to fit well in existing branch predictorsable to fit well in existing branch predictors– Example: Database applications
• Solution: Use multi-level branch prediction schemep• Benefits:
– Higher performance through improved branch prediction accuracyaccuracy
– Greater power efficiency through less mis-speculation
114
Advanced Renamed Return Stack Buffer (RSB)
• Instruction Reminder– CALL: Entry into functions
RET: Return from functions– RET: Return from functions
• Classical Solution– Return Stack Buffer (RSB) used to predict RET– RSB can be corrupted by speculative path
• The Renamed RSBNo RET mispredicts in the common case– No RET mispredicts in the common case
115
Inclusive vs. Exclusive Caches –Cache MissCache Miss
Exclusive Inclusive
L3 Cache L3 Cache
Core0
Core1
Core2
Core3
Core0
Core1
Core2
Core3
Data request from Core 0 misses Core 0’s L1 and L2Request sent to the L3 cache
116
Inclusive vs. Exclusive Caches –Cache MissCache Miss
Exclusive Inclusive
L3 Cache L3 CacheMISS!MISS! MISS!
Core0
Core1
Core2
Core3
Core0
Core1
Core2
Core3
Core 0 looks up the L3 CacheData not in the L3 Cache
117
Data not in the L3 Cache
Inclusive vs. Exclusive Caches –Cache MissCache Miss
Exclusive Inclusive
L3 Cache L3 CacheMISS!MISS! MISS!
Core0
Core1
Core2
Core3
Core0
Core1
Core2
Core3
Must check other cores Guaranteed data is not on-die
Greater scalability from inclusive approach
118
Inclusive vs. Exclusive Caches –Cache HitCache Hit
Exclusive Inclusive
L3 Cache L3 CacheHIT!HIT! HIT!
Core0
Core1
Core2
Core3
Core0
Core1
Core2
Core3
No need to check other cores Data could be in another core BUTIntel® CoreTM microarchitecture(N h l ) i t
119
(Nehalem) is smart…
Inclusive vs. Exclusive Caches –Cache HitCache Hit
Inclusive
•Maintain a set of “core valid” bits per cache line
L3 CacheHIT!
valid bits per cache line in the L3 cache
•Each bit represents a core• If the L1/L2 of a core may HIT!• If the L1/L2 of a core maycontain the cache line, then core valid bit is set to “1”
0 0 0 0
Core0
Core1
Core2
Core3
•No snoops of cores are needed if no bits are set
• If more than 1 bit is set, line cannot be in Modified
Core valid bits limit unnecessary snoops
line cannot be in Modified state in any core
120
Inclusive vs. Exclusive Caches –Read from other coreRead from other core
Exclusive Inclusive
L3 Cache L3 CacheHIT!MISS! HIT! 0 0 1 0
Core0
Core1
Core2
Core3
Core0
Core1
Core2
Core3
Must check all other cores Only need to check the corewhose core valid bit is set
121
Local Memory Access• CPU0 requests cache line X not present in any CPU0 cache• CPU0 requests cache line X, not present in any CPU0 cache
– CPU0 requests data from its DRAM– CPU0 snoops CPU1 to check if data is present
• Step 2:– DRAM returns dataDRAM returns data– CPU1 returns snoop response
• Local memory latency is the maximum latency of the two responses• Intel® Core™ microarchitecture (Nehalem) optimized to keep key latencies
close to each other
Intel®
CPU0 CPU1
Intel®
QPIDRAMDRAM
122
Intel® QPI = Intel® QuickPath Interconnect
Remote Memory AccessCPU0 t h li X t t i CPU0 h• CPU0 requests cache line X, not present in any CPU0 cache– CPU0 requests data from CPU1– Request sent over Intel® QuickPath Interconnect (Intel® QPI) to
CPU1CPU1– CPU1’s IMC makes request to its DRAM– CPU1 snoops internal caches– Data returned to CPU0 over Intel QPIQ
• Remote memory latency a function of having a low latency interconnect
CPU0 CPU1
Intel®
QPI DRAMDRAM C U0DRAM
123
Example Code For strlen()t i [ + 4] STTNI Version
int sttni_strlen(const char * src){
char eom_vals[32] = {1, 255, 0};
asm{
string equ [esp + 4]mov ecx,string ; ecx -> stringtest ecx,3 ; test if string is aligned on 32 bitsje short main_loop
str_misaligned:; simple byte loop until string is alignedmov al,byte ptr [ecx]
je short byte_2test eax,0ff000000h
; is it byte 3je short byte_3jmp short main_loop
STTNI Version
__asm{
mov eax, src
movdqu xmm2, eom_vals
xor ecx, ecx
mov al,byte ptr [ecx]add ecx,1test al,alje short byte_3test ecx,3jne short str_misalignedadd eax,dword ptr 0 ; 5 byte nop to align label belowl h ld b d d
; taken if bits 24-30 are clear and bit; 31 is setbyte_3:
lea eax,[ecx - 1]mov ecx,stringsub eax,ecxret xor ecx, ecx
topofloop:
add eax, ecx
movdqu xmm1, OWORD PTR[eax]
align 16 ; should be redundantmain_loop:
mov eax,dword ptr [ecx] ; read 4 bytesmov edx,7efefeffhadd edx,eaxxor eax,-1xor eax edx
retbyte_2:
lea eax,[ecx - 2]mov ecx,stringsub eax,ecxret
byte 1:
pcmpistri xmm2, xmm1, imm8
jnz topofloop
endofstring:
xor eax,edxadd ecx,4test eax,81010100hje short main_loop; found zero byte in the loopmov eax,[ecx - 4]test al,al ; is it byte 0
y _lea eax,[ecx - 3]mov ecx,stringsub eax,ecxret
byte_0:lea eax,[ecx - 4]
t iadd eax, ecx
sub eax, srcret
}
je short byte_0test ah,ah ; is it byte 1je short byte_1test eax,00ff0000h ; is it byte 2
mov ecx,stringsub eax,ecxret
strlen endpend
124
}Current Code: Minimum of 11 instructions; Inner loop processes 4 bytes with 8 instructions
STTNI Code: Minimum of 10 instructions; A single inner loop processes 16 bytes with only 4 instructions
CRC32 Preliminary PerformancePreliminary tests involved Kernel code implementing
CRC32 optimized Code
crc32c_sse42_optimized_version(uint32 crc, unsigned
char const *p, size_t len)
{ // Assuming len is a multiple of 0x10
asm("pusha");
Preliminary tests involved Kernel code implementingCRC algorithms commonly used by iSCSI drivers.
32-bit and 64-bit versions of the Kernel under test
32-bit version processes 4 bytes of data using 1 CRC32 instruction
asm("mov %0, %%eax" :: "m" (crc));
asm("mov %0, %%ebx" :: "m" (p));
asm("mov %0, %%ecx" :: "m" (len));
asm("1:");
// Processing four byte at a time: Unrolled four times:
1 CRC32 instruction
64-bit version processes 8 bytes of data using 1 CRC32 instruction
Input strings of sizes 48 bytes and 4KB used for the t t// Processing four byte at a time: Unrolled four times:
asm("crc32 %eax, 0x0(%ebx)");
asm("crc32 %eax, 0x4(%ebx)");
asm("crc32 %eax, 0x8(%ebx)");
asm("crc32 %eax, 0xc(%ebx)");
test
32 - bit 64 - bit
Input 6.53 X 9.85 Xasm("add $0x10, %ebx")2;
asm("sub $0x10, %ecx");
asm("jecxz 2f");
asm("jmp 1b");
asm("2:");
Input Data Size = 48 bytes
6.53 X 9.85 X
Input 9.3 X 18.63 Xasm( 2: );
asm("mov %%eax, %0" : "=m" (crc));
asm("popa");
return crc;
}}
pData Size = 4 KB
125
Preliminary Results show CRC32 instruction out-performing the fastest CRC32C software algorithm by a big margin