A Look Inside Intel: The Core (Nehalem) Look Inside Intel: The Core (Nehalem) Microarchitecture Beeman Strong Intel Core™ microarchitecture (Nehalem) Architect Intel Corporation

  • Published on
    04-May-2018

  • View
    213

  • Download
    1

Embed Size (px)

Transcript

  • A Look Inside Intel: The Core (Nehalem)

    Microarchitecture

    Beeman StrongIntel Core microarchitecture

    (Nehalem) ArchitectIntel Corporation

  • 2

    Intel Core Microarchitecture (Nehalem) Design Overview

    Enhanced Processor Core Performance Features Intel Hyper-Threading Technology

    New Platform New Cache Hierarchy New Platform Architecture

    Performance Acceleration Virtualization New Instructions

    Power Management Overview Minimizing Idle Power Consumption Performance when it counts

    Agenda

  • 3

    Scalable CoresCommon feature setCommon feature set

    Same core forSame core forall segmentsall segments

    Common softwareCommon softwareoptimizationoptimization

    45nm45nm

    Servers/WorkstationsEnergy Efficiency, Performance, Virtualization, Reliability, Capacity, Scalability

    DesktopPerformance, Graphics, Energy Efficiency, Idle Power, Security

    MobileBattery Life, Performance, Energy Efficiency, Graphics, Security

    Optimized cores to meet all market Optimized cores to meet all market segmentssegments

    Intel Core Intel Core Microarchitecture Microarchitecture

    (Nehalem)(Nehalem)

  • 4

    The First Intel Core Microarchitecture (Nehalem) Processor

    A Modular Design for FlexibilityA Modular Design for Flexibility

    Misc IO

    Misc IO

    QPI 1

    QPI 0

    Memory Controller

    Core Core Core CoreQueue

    Shared L3 Cache

    QPI: Intel QuickPath Interconnect (Intel

    QPI)

  • 5

    Intel Core Microarchitecture (Nehalem) Design Overview

    Enhanced Processor Core Performance Features Intel Hyper-Threading Technology

    New Platform New Cache Hierarchy New Platform Architecture

    Performance Acceleration Virtualization New Instructions

    Power Management Overview Minimizing Idle Power Consumption Performance when it counts

    Agenda

  • 6

    Intel Core Microarchitecture Recap

    Wide Dynamic Execution 4-wide decode/rename/retire

    Advanced Digital Media Boost 128-bit wide SSE execution units

    Intel HD Boost New SSE4.1 Instructions

    Smart Memory Access Memory Disambiguation Hardware Prefetching

    Advanced Smart Cache Low latency, high BW shared L2 cache

    Nehalem builds on the great Core microarchitecture

  • 7

    Designed for Performance

    ExecutionUnits

    Out-of-OrderScheduling &Retirement

    L2 Cache& InterruptServicing

    Instruction Fetch& L1 Cache

    Branch PredictionInstructionDecode &Microcode

    Paging

    L1 Data Cache

    Memory Ordering& Execution

    Additional CachingHierarchy

    New SSE4.2 Instructions

    Deeper Buffers

    FasterVirtualization

    SimultaneousMulti-Threading Better BranchPrediction

    Improved Lock

    Support

    ImprovedLoop

    Streaming

  • 8

    Macrofusion Introduced in Intel Core2 microarchitecture TEST/CMP instruction followed by a conditional branch treated

    as a single instruction Decode/execute/retire as one instruction

    Higher performance & improved power efficiency Improves throughput/Reduces execution latency Less processing required to accomplish the same work

    Support all the cases in Intel Core 2 microarchitecture PLUS CMP+Jcc macrofusion added for the following branch conditions

    JL/JNGE JGE/JNL JLE/JNG JG/JNLE

    Intel Core microarchitecture (Nehalem) supports macrofusion in both 32-bit and 64-bit modes Intel Core2 microarchitecture only supports macrofusion in 32-bit

    mode

    Increased macrofusion benefit on Intel

    Core microarchitecture (Nehalem)

  • 9

    Intel Core Microarchitecture (Nehalem) Loop Stream Detector

    Loop Stream Detector identifies software loops Stream from Loop Stream Detector instead of normal path Disable unneeded blocks of logic for power savings Higher performance by removing instruction fetch limitations

    Higher performance: Expand the size of the loops detected (vs Core 2) Improved power efficiency: Disable even more logic (vs Core 2)

    Intel Core Microarchitecture (Nehalem) Loop Stream Detector

    Branch

    PredictionFetch Decode

    Loop

    Stream

    Detector

    28

    Micro-Ops

  • 10

    Branch Prediction Improvements

    Focus on improving branch prediction accuracy each CPU generation Higher performance & lower power through more

    accurate prediction

    Example Intel Core microarchitecture (Nehalem) improvements L2 Branch Predictor

    Improve accuracy for applications with large code size (ex. database applications)

    Advanced Renamed Return Stack Buffer (RSB) Remove branch mispredicts on x86 RET instruction (function

    returns) in the common case

    Greater Performance through Branch Prediction

  • 11

    Execute 6 operations/cycle

    3 Memory Operations

    1 Load

    1 Store Address

    1 Store Data

    3 Computational Operations

    Execution Unit Overview

    Unified Reservation Station

    Po

    rt 0

    Po

    rt 1

    Po

    rt 2

    Po

    rt 3

    Po

    rt 4

    Po

    rt 5

    Load StoreAddressStore

    Data

    Integer ALU &

    ShiftInteger ALU &

    LEA

    Integer ALU &

    Shift

    BranchFP AddFP Multiply

    Complex

    IntegerDivide

    SSE Integer ALU

    Integer ShufflesSSE Integer

    Multiply

    FP Shuffle

    SSE Integer ALU

    Integer Shuffles

    Unified Reservation Station

    Schedules operations to Execution units

    Single Scheduler for all Execution Units

    Can be used by all integer, all FP, etc.

  • 12

    Increased Parallelism Goal: Keep powerful

    execution engine fed Nehalem increases size of

    out of order window by 33% Must also increase other

    corresponding structures 016

    32

    48

    64

    80

    96

    112

    128

    Dothan Merom Nehalem

    Concurrent uOps Possible

    Increased Resources for Higher Performance

    Tracks all store operations allocated

    3220Store Buffers

    Tracks all load operations allocated

    4832Load Buffers

    Dispatches operations to execution units

    3632Reservation Station

    CommentIntel Core microarchitecture (Nehalem)

    Intel Core microarchitecture (formerly Merom)

    Structure

    1Intel Pentium M processor (formerly Dothan)Intel Core microarchitecture (formerly Merom)Intel Core microarchitecture (Nehalem)

    1

  • 13

    Enhanced Memory Subsystem

    Responsible for: Handling of memory operations (loads/stores)

    Key Intel Core2 Features Memory Disambiguation Hardware Prefetchers Advanced Smart Cache

    New Intel Core Microarchitecture (Nehalem) Features New TLB Hierarchy (new, low latency 2nd level unified TLB) Fast 16-Byte unaligned accesses Faster Synchronization Primitives

  • 14

    Intel Hyper-Threading Technology Also known as Simultaneous Multi-

    Threading (SMT) Run 2 threads at the same time per core

    Take advantage of 4-wide execution engine Keep it fed with multiple threads Hide latency of a single thread

    Most power efficient performance feature Very low die area cost Can provide significant performance benefit

    depending on application Much more efficient than adding an entire

    core

    Intel Core microarchitecture (Nehalem) advantages Larger caches Massive memory BW

    Simultaneous multi-threading enhances performance and energy efficiency

    Tim

    e (p

    roc.

    cyc

    les)

    w/o SMT SMT

    Note: Each box represents a

    processor execution unit

  • 15

    SMT Performance Chart

    Source: Intel. Configuration: pre-production Intel Core i7 processor with 3 channel DDR3 memory. Performance tests and ratings are measured using specific computer systems and / or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/

    SPEC, SPECint, SPECfp, and SPECrate are trademarks of the Standard Performance Evaluation Corporation. For more information on SPEC benchmarks, see: http://www.spec.org

    7%10%

    13%16%

    29%34%

    0%5%

    10%15%20%25%30%35%40%

    Floating Point 3dsMax* Integer Cinebench* 10POV-Ray* 3.7 beta 25

    3DMark* Vantage* CPU

    Performance Gain SMT enabled vs disabled

    Intel Core i7

    Floating Point is based on SPECfp_rate_base2006* estimateInteger is based on SPECint_rate_base2006* estimate

    http://www.spec.org/http://www.spec.org/

  • 16

    Intel Core Microarchitecture (Nehalem) Design Overview

    Enhanced Processor Core Performance Features Intel Hyper-Threading Technology

    New Platform New Cache Hierarchy New Platform Architecture

    Performance Acceleration Virtualization New Instructions

    Power Management Overview Minimizing Idle Power Consumption Performance when it counts

    Agenda

  • 17

    Designed For Modularity

    Optimal price / performance / energy Optimal price / performance / energy efficiencyefficiencyfor server, desktop and mobile productsfor server, desktop and mobile products

    DRAM

    Intel QPIIntel QPI

    Core

    Uncore

    CORE

    CORE

    CORE

    IMC

    Intel

    QPI

    Power Power &&

    ClockClock

    #QPI#QPILinksLinks

    # mem# memchannelschannels

    Size ofSi

Recommended

View more >