Click here to load reader

A Look Inside Intel®: The Core (Nehalem) Microarchitecture Beeman Strong Intel ® Core™ microarchitecture (Nehalem) Architect Intel Corporation

  • View
    232

  • Download
    2

Embed Size (px)

Text of A Look Inside Intel®: The Core (Nehalem) Microarchitecture Beeman Strong Intel ® Core™...

  • Slide 1

A Look Inside Intel: The Core (Nehalem) Microarchitecture Beeman Strong Intel Core microarchitecture (Nehalem) Architect Intel Corporation Slide 2 2 Intel Core Microarchitecture (Nehalem) Design Overview Enhanced Processor Core Performance Features Intel Hyper-Threading Technology New Platform New Cache Hierarchy New Platform Architecture Performance Acceleration Virtualization New Instructions Power Management Overview Minimizing Idle Power Consumption Performance when it counts Agenda Slide 3 3 Scalable Cores Common feature set Same core for all segments Common software optimization 45nm Servers/Workstations Energy Efficiency, Performance, Virtualization, Reliability, Capacity, Scalability Desktop Performance, Graphics, Energy Efficiency, Idle Power, Security Mobile Battery Life, Performance, Energy Efficiency, Graphics, Security Optimized cores to meet all market segments Intel Core Microarchitecture (Nehalem) Slide 4 4 The First Intel Core Microarchitecture (Nehalem) Processor A Modular Design for Flexibility Misc IOMisc IO Misc IOMisc IO QPI 1QPI 1 QPI 0QPI 0 Memory Controller Core QueueQueue Shared L3 Cache QPI: Intel QuickPath Interconnect (Intel QPI) Slide 5 5 Intel Core Microarchitecture (Nehalem) Design Overview Enhanced Processor Core Performance Features Intel Hyper-Threading Technology New Platform New Cache Hierarchy New Platform Architecture Performance Acceleration Virtualization New Instructions Power Management Overview Minimizing Idle Power Consumption Performance when it counts Agenda Slide 6 6 Intel Core Microarchitecture Recap Wide Dynamic Execution 4-wide decode/rename/retire Advanced Digital Media Boost 128-bit wide SSE execution units Intel HD Boost New SSE4.1 Instructions Smart Memory Access Memory Disambiguation Hardware Prefetching Advanced Smart Cache Low latency, high BW shared L2 cache Nehalem builds on the great Core microarchitecture Slide 7 7 Designed for Performance Execution Units Out-of-Order Scheduling & Retirement L2 Cache & Interrupt Servicing Instruction Fetch & L1 Cache Branch Prediction Instruction Decode & Microcode Paging L1 Data Cache Memory Ordering & Execution Additional Caching Hierarchy New SSE4.2 Instructions Deeper Buffers Faster Virtualization Simultaneous Multi-Threading Better Branch Prediction Improved Lock Support Improved Loop Streaming Slide 8 8 Macrofusion Introduced in Intel Core2 microarchitecture TEST/CMP instruction followed by a conditional branch treated as a single instruction Decode/execute/retire as one instruction Higher performance & improved power efficiency Improves throughput/Reduces execution latency Less processing required to accomplish the same work Support all the cases in Intel Core 2 microarchitecture PLUS CMP+Jcc macrofusion added for the following branch conditions JL/JNGE JGE/JNL JLE/JNG JG/JNLE Intel Core microarchitecture (Nehalem) supports macrofusion in both 32-bit and 64-bit modes Intel Core2 microarchitecture only supports macrofusion in 32-bit mode Increased macrofusion benefit on Intel Core microarchitecture (Nehalem) Slide 9 9 Intel Core Microarchitecture (Nehalem) Loop Stream Detector Loop Stream Detector identifies software loops Stream from Loop Stream Detector instead of normal path Disable unneeded blocks of logic for power savings Higher performance by removing instruction fetch limitations Higher performance: Expand the size of the loops detected (vs Core 2) Improved power efficiency: Disable even more logic (vs Core 2) Intel Core Microarchitecture (Nehalem) Loop Stream Detector Branch Prediction Fetch Decode Loop Stream Detector 28 Micro-Ops Slide 10 10 Branch Prediction Improvements Focus on improving branch prediction accuracy each CPU generation Higher performance & lower power through more accurate prediction Example Intel Core microarchitecture (Nehalem) improvements L2 Branch Predictor Improve accuracy for applications with large code size (ex. database applications) Advanced Renamed Return Stack Buffer (RSB) Remove branch mispredicts on x86 RET instruction (function returns) in the common case Greater Performance through Branch Prediction Slide 11 11 Execute 6 operations/cycle 3 Memory Operations 1 Load 1 Store Address 1 Store Data 3 Computational Operations Execution Unit Overview Unified Reservation Station Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Load Store Address Store Data Integer ALU & Shift Integer ALU & LEA Integer ALU & Shift Branch FP Add FP Multiply Complex Integer Divide SSE Integer ALU Integer Shuffles SSE Integer Multiply FP Shuffle SSE Integer ALU Integer Shuffles Unified Reservation Station Schedules operations to Execution units Single Scheduler for all Execution Units Can be used by all integer, all FP, etc. Slide 12 12 Increased Parallelism Goal: Keep powerful execution engine fed Nehalem increases size of out of order window by 33% Must also increase other corresponding structures Increased Resources for Higher Performance StructureIntel Core microarchitecture (formerly Merom) Intel Core microarchitecture (Nehalem) Comment Reservation Station3236Dispatches operations to execution units Load Buffers3248Tracks all load operations allocated Store Buffers2032Tracks all store operations allocated 1 Intel Pentium M processor (formerly Dothan) Intel Core microarchitecture (formerly Merom) Intel Core microarchitecture (Nehalem) 1 Slide 13 13 Enhanced Memory Subsystem Responsible for: Handling of memory operations (loads/stores) Key Intel Core2 Features Memory Disambiguation Hardware Prefetchers Advanced Smart Cache New Intel Core Microarchitecture (Nehalem) Features New TLB Hierarchy (new, low latency 2 nd level unified TLB) Fast 16-Byte unaligned accesses Faster Synchronization Primitives Slide 14 14 Intel Hyper-Threading Technology Also known as Simultaneous Multi- Threading (SMT) Run 2 threads at the same time per core Take advantage of 4-wide execution engine Keep it fed with multiple threads Hide latency of a single thread Most power efficient performance feature Very low die area cost Can provide significant performance benefit depending on application Much more efficient than adding an entire core Intel Core microarchitecture (Nehalem) advantages Larger caches Massive memory BW Simultaneous multi-threading enhances performance and energy efficiency Time (proc. cycles) w/o SMT SMT Note: Each box represents a processor execution unit Slide 15 15 SMT Performance Chart Source: Intel. Configuration: pre-production Intel Core i7 processor with 3 channel DDR3 memory. Performance tests and ratings are measured using specific computer systems and / or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/ SPEC, SPECint, SPECfp, and SPECrate are trademarks of the Standard Performance Evaluation Corporation. For more information on SPEC benchmarks, see: http://www.spec.orghttp://www.spec.org Floating Point is based on SPECfp_rate_base2006* estimate Integer is based on SPECint_rate_base2006* estimate Slide 16 16 Intel Core Microarchitecture (Nehalem) Design Overview Enhanced Processor Core Performance Features Intel Hyper-Threading Technology New Platform New Cache Hierarchy New Platform Architecture Performance Acceleration Virtualization New Instructions Power Management Overview Minimizing Idle Power Consumption Performance when it counts Agenda Slide 17 17 Designed For Modularity Optimal price / performance / energy efficiency for server, desktop and mobile products DRAM Intel QPI Core Uncore CORECORE CORECORE CORECORE IM C Intel QPI Power&Clock #QPILinks # mem channels Size of cache # cores PowerManage-ment Type of Memory Integratedgraphics Differentiation in the Uncore: 2008 2009 Servers & Desktops L3 Cache Intel QPI: Intel QuickPath Interconnect (Intel QPI) Intel QPI Slide 18 18 Intel Smart Cache 3 rd Level Cache Shared across all cores Size depends on # of cores Quad-core: Up to 8MB (16-ways) Scalability: Built to vary size with varied core counts Built to easily increase L3 size in future parts Perceived latency depends on frequency ratio between core & uncore Inclusive cache policy for best performance Address residing in L1/L2 must be present in 3 rd level cache L3 Cache Core L2 Cache L1 Caches Core L2 Cache L1 Caches Core L2 Cache L1 Caches Slide 19 19 Why Inclusive? Inclusive cache provides benefit of an on-die snoop filter Core Valid Bits 1 bit per core per cache line If line may be in a core, set core valid bit Snoop only needed if line is in L3 and core valid bit is set Guaranteed that line is not modified if multiple bits set Scalability Addition of cores/sockets does not increase snoop traffic seen by cores Latency Minimize effective cache latency by eliminating cross-core snoops in the common case Minimize snoop response time for cross-socket cases Slide 20 20 Intel Core Microarchitecture (Nehalem-EP) Platform Architecture Integrated Memory Controller 3 DDR3 channels per socket Massive memory bandwidth Memory Bandwidth scales with # of processors Very low memory latency Intel QuickPath Interconnect (Intel QPI) New point-to-point interconnect Socket to socket connections Socket to chipset connections Build scalable solutions Up to 6.4 GT/sec (12.8 GB/sec) Bidirectional (=> 25.6 GB/sec) Nehalem EP Tylersburg EP Significant performance leap from new platform Intel Core microarchitecture (Nehalem-EP) Intel Next Generation Server Processor Technology (Tylersburg-EP) IOH memory CPU IOH memory Slide 21 21 Non-Uniform Memory Access (NUMA) FSB architectu

Search related