Intel Nehalem Processor

Embed Size (px)

Citation preview

  • 8/11/2019 Intel Nehalem Processor

    1/65

    Ronak Singhal

    Nehalem Architect

    NGMS001

    Inside Intel Next GenerationNehalem Microarchitecture

  • 8/11/2019 Intel Nehalem Processor

    2/65

    2

    Nehalem Design Philosophy Enhanced Processor Core

    Performance Features

    Simultaneous Multi-Threading Feeding the Engine

    New Cache Hierarchy

    New Platform Architecture

    Performance AccelerationVirtualizationNew Instructions

    Agenda

    Nehalem: Next Generation Intel Microarchitecture

  • 8/11/2019 Intel Nehalem Processor

    3/65

    3

    NEHALEM

    TO K

    SANDY

    BRIDGE

    TO K

    WESTMERE

    2009-10

    32nm

    T K

    2005-06

    Intel

    Core2,Xeon

    processors

    Intel

    Pentium

    D,Xeon,

    Core

    processors

    T K TO K

    2007-08

    PENRYN

    45nm

    T K

    65nm

    Tick Tock Development Model

    NehalemNehalem --

    The IntelThe Intel

    45 nm Tock Processor45 nm Tock Processor

  • 8/11/2019 Intel Nehalem Processor

    4/654

    Nehalem Design Goals

    Existing AppsExisting Apps

    Emerging AppsEmerging Apps

    All UsagesAll Usages

    Single ThreadSingle Thread

    MultiMulti--threadsthreads

    Workstation / ServerDesktop / Mobile

    World class performance combined with superior energy efficiencyWorld class performance combined with superior energy efficiency

    Optimized for:Optimized for:

    A single, scalable, foundation optimized across each segment andA single, scalable, foundation optimized across each segment and

    power envelopepower envelope

    Dynamically scaledperformance when

    needed to maximizeenergy efficiency

    Nehalem: Next Generation Intel Microarchitecture

    A Dynamic and Design Scalable Microarchitecture

    Dynamic and Design Scalable Microarchitecture

  • 8/11/2019 Intel Nehalem Processor

    5/655

    Scalable Nehalem Core

    Common feature setCommon feature setSame core forSame core for

    all segmentsall segments

    Common softwareCommon software

    optimizationoptimization

    45nm45nm

    Servers/Workstations

    Energy Efficiency,Performance,

    Virtualization,Reliability, Capacity, Scalability

    Nehalem

    ehalem

    DesktopPerformance,

    Graphics, EnergyEfficiency, Idle Power, Security

    Mobile

    Battery Life, Performance,

    Energy Efficiency, Graphics,Security

    Optimized Nehalem core to meet all market segments

    ptimized Nehalem core to meet all market segments

    Nehalem: Next Generation Intel Microarchitecture

  • 8/11/2019 Intel Nehalem Processor

    6/656

    Intel Core Microarchitecture Recap

    Wide Dynamic Execution

    - 4-wide decode/rename/retire

    Advanced Digital Media Boost

    - 128-bit wide SSE execution units Intel HD Boost

    - New SSE4.1 Instructions

    Smart Memory Access

    - Memory Disambiguation- Hardware Prefetching

    Advanced Smart Cache

    - Low latency, high BW shared L2 cache

    N e h a le m b u i l d s o n t h e g r e a t I n t e l Co r e m i c r o a r ch i t e c t u r e

    Nehalem: Next Generation Intel

    Microarchitecture

  • 8/11/2019 Intel Nehalem Processor

    7/657

    Nehalem Design Philosophy Enhanced Processor Core

    Performance Features

    Simultaneous Multi-Threading Feeding the Engine

    New Cache HierarchyNew Platform Architecture

    Performance AccelerationVirtualizationNew Instructions

    Agenda

    Nehalem: Next Generation Intel

    Microarchitecture

  • 8/11/2019 Intel Nehalem Processor

    8/658

    Designed for Performance

    Execution

    Units

    Out-of-Order

    Scheduling &

    Retirement

    L2 Cache

    & Interrupt

    Servicing

    Instruction Fetch

    & L1 Cache

    Branch PredictionInstruction

    Decode &

    Microcode

    Paging

    L1 Data Cache

    Memory Ordering

    & Execution

    Additional Caching

    Hierarchy

    New SSE4.2

    Instructions

    DeeperBuffers

    Faster

    Virtualization

    Simultaneous

    Multi-ThreadingBetter Branch

    Prediction

    Improved Lock

    Support

    ImprovedLoop

    Streaming

  • 8/11/2019 Intel Nehalem Processor

    9/659

    Enhanced Processor Core

    Instruction Fetch andPre Decode

    Instruction Queue

    Decode

    ITLB

    Rename/Allocate

    Retirement Unit(ReOrder Buffer)

    Reservation Station

    Execution Units

    DTLB

    2nd Level TLB4

    4

    6

    32kBInstruction Cache

    32kB

    Data Cache

    256kB

    2nd Level Cache

    L3 and beyond

    Fr o n t En d E x e c u t i o n

    E n g i n e

    M e m o r y

  • 8/11/2019 Intel Nehalem Processor

    10/6510

    Front-end

    Responsible for feeding the compute engine- Decode instructions- Branch Prediction

    Key Intel Core2 processor features- 4-wide decode- Macrofusion- Loop Stream Detector

    Instruction Fetch and

    Pre Decode

    Instruction Queue

    Decode

    ITLB32kB

    Instruction Cache

  • 8/11/2019 Intel Nehalem Processor

    11/6511

    Macrofusion Recap

    Introduced in the Intel Core2 Processor Family

    TEST/CMP instruction followed by a conditionalbranch treated as a single instruction

    - Decode as one instruction- Execute as one instruction- Retire as one instruction

    Higher p e r f o r m a n ce - Improves throughput- Reduces execution latency

    Improved p o w e r e f f i c i e n c y - Less processing required to accomplish the same work

  • 8/11/2019 Intel Nehalem Processor

    12/65

    12

    Nehalem Macrofusion

    Goal: Identify more macrofusion opportunities for increasedp e r f o r m a n c e and p o w e r e f f i c i e n c y

    Support all the cases in Intel Core2 processor PLUS- CMP+Jcc macrofusion added for the following branch conditions

    JL/JNGE

    JGE/JNL

    JLE/JNG

    JG/JNLE

    Intel Core 2 processor only supports macrofusion in 32-bit mode

    - Nehalem supports macrofusion in both 32-bit and 64-bit modes

    I n c r e a s e d m a c r o f u s i o n b e n e f i t o n N e h a l e m

    Nehalem: Next Generation Intel

    Microarchitecture

  • 8/11/2019 Intel Nehalem Processor

    13/65

    13

    Loop Stream Detector Reminder

    Loops are very common in most software Take advantage of knowledge of loops in HW

    - Decoding the same instructions over and over- Making the same branch predictions over and over

    Loop Stream Detector identifies software loops- Stream from Loop Stream Detector instead of normal path- Disable unneeded blocks of logic for p o w e r sa v in g s - H i g h e r p e r f o r m a n ce by removing instruction fetch limitations

    I n t e l Co r e 2 p r o c e ss o r L o o p St r e a m D e t e c t o r

    Branch

    Prediction Fetch Decode

    Loop

    Stream

    Detector

    1 8

    I n s t r u c t i o n s

    Nehalem: Next Generation Intel

    Microarchitecture

  • 8/11/2019 Intel Nehalem Processor

    14/65

    14

    Nehalem Loop Stream Detector

    Same concept as in prior implementations

    H i g h e r p e r f o r m a n ce : Expand the size of theloops detected

    I m p r o v e d p o w e r e f f i c i e n c y : Disable even morelogic

    N e h a l e m Lo o p S t r e a m D e t e c t o r

    Branch

    Prediction Fetch Decode

    Loop

    Stream

    Detector

    2 8

    M i c r o - O p s

    Nehalem: Next Generation Intel Microarchitecture

  • 8/11/2019 Intel Nehalem Processor

    15/65

    15

    Branch Prediction Reminder

    Goal: Keep powerful compute engine fed

    Options:- Stall pipeline while determining branch direction/target- Predict branch direction/target and correct if wrong

    High branch prediction accuracy leads to:- Pe r f o r m a n ce : Minimize amount of time wasted correcting

    from incorrect branch predictions Through higher branch prediction accuracy

    Through faster correction when prediction is wrong

    - Po w e r e f f i c i e n c y : Minimize number ofspeculative/incorrect micro-ops that are executedCo n t i n u e d f o c u s o n b r a n ch

    p r e d i c t i o n i m p r o v e m e n t s

  • 8/11/2019 Intel Nehalem Processor

    16/65

    16

    L2 Branch Predictor

    Problem: Software with a large code footprint notable to fit well in existing branch predictors

    - Example: Database applications

    Solution: Use multi-level branch prediction scheme

    Benefits:- Higher p e r f o r m a n ce through improved branch prediction

    accuracy

    - Greater p o w e r e f f i c i e n c y through less mis-speculation

    d S k ff ( S )

  • 8/11/2019 Intel Nehalem Processor

    17/65

    17

    Renamed Return Stack Buffer (RSB)

    Instruction Reminder- CALL: Entry into functions- RET: Return from functions

    Classical Solution- Return Stack Buffer (RSB) used to predict RET- Problems:

    RSB can be corrupted by speculative path

    RSB can overflow, overwriting valid entries

    The Re n am e d RSB - No RET mispredicts in the common case

    - No stack overflow issue

    E i E i

  • 8/11/2019 Intel Nehalem Processor

    18/65

    18

    Execution Engine

    Start with powerful Intel Core2 processorexecution engine- Dynamic 4-wide Execution

    - Advanced Digital Media Boost 128-bit wide SSE- HD Boost (Penryn)

    SSE4.1 instructions

    - Super Shuffler (Penryn)- Add Nehalem enhancements- Additional parallelism for higher performance

    Nehalem: Next Generation Intel

    Microarchitecture

    E ti U it O i

  • 8/11/2019 Intel Nehalem Processor

    19/65

    19

    Execute 6 operations/cycle

    3 Memory Operations

    1 Load

    1 Store Address

    1 Store Data

    3 Computational Operations

    Execution Unit Overview

    Unified Reservation Station

    Port0

    Port1

    Port2

    Po

    rt3

    Port4

    Po

    rt5

    LoadStore

    Address

    Store

    Data

    Integer ALU &

    Shift

    Integer ALU &

    LEA

    Integer ALU &

    Shift

    BranchFP AddFP Multiply

    Complex

    IntegerDivide

    SSE Integer ALU

    Integer Shuffles

    SSE Integer

    Multiply

    FP Shuffle

    SSE Integer ALU

    Integer Shuffles

    Unified Reservation Station Schedules operations to Execution units

    Single Scheduler for all Execution Units

    Can be used by all integer, all FP, etc.

    I d P ll li

  • 8/11/2019 Intel Nehalem Processor

    20/65

    20

    Increased Parallelism

    Goal: Keep powerfulexecution engine fed

    Nehalem increases size of

    out of order window by 33% Must also increase other

    corresponding structures 016

    3248

    64

    80

    96

    112

    128

    Dothan Merom Nehalem

    Concurrent uOps Possible

    I n c r e a s e d R eso u r c e s f o r H i g h e r Pe r f o r m a n c e

    Structure Merom Nehalem CommentReservation Station 32 36 Dispatches operations

    to execution units

    Load Buffers 32 48 Tracks all loadoperations allocated

    Store Buffers 20 32 Tracks all storeoperations allocated

    Nehalem: Next Generation Intel

    Microarchitecture

    Source: Microarchitecture specifications

    E h d M S b t

  • 8/11/2019 Intel Nehalem Processor

    21/65

    21

    Enhanced Memory Subsystem

    Start with great Intel Core 2 Processor Features- Memory Disambiguation- Hardware Prefetchers

    - Advanced Smart Cache

    New Nehalem Features- New TLB Hierarchy

    - Fast 16-Byte unaligned accesses- Faster Synchronization Primitives

    New TLB Hierarchy

  • 8/11/2019 Intel Nehalem Processor

    22/65

    22

    New TLB Hierarchy

    Problem: Applications continue to grow in data size

    Need to increase TLB size to keep the pace for performance

    Nehalem adds new low-latency unified 2nd level TLB

    # of Entries

    1st Level Instruction TLBs

    Small Page (4k) 128

    Large Page (2M/4M) 7 per thread

    1st Level Data TLBs

    Small Page (4k) 64

    Large Page (2M/4M) 32

    New 2nd Level Unified TLB

    Small Page Only 512

    Fast Unaligned Cache Accesses

  • 8/11/2019 Intel Nehalem Processor

    23/65

    23

    Fast Unaligned Cache Accesses

    Two flavors of 16-byte SSE loads/stores exist- Aligned (MOVAPS/D, MOVDQA) -- Must be aligned on a 16-byte boundary- Unaligned (MOVUPS/D, MOVDQU) -- No alignment requirement

    Prior to Nehalem- Optimized for Aligned instructions

    - Unaligned instructions slower, lower throughput -- Even for aligned accesses! Required multiple uops (not energy efficient)- Compilers would largely avoid unaligned load

    2-instruction sequence (MOVSD+MOVHPD) was faster

    Nehalem optimizes Unaligned instructions- Same speed/throughput as Aligned instructions on aligned accesses

    - Optimizations for making accesses that cross 64-byte boundaries fast Lower latency/higher throughput than Intel Core 2 processor- Aligned instructions remain fast

    No reason to use aligned instructions on Nehalem! Benefits:

    - Compiler can now use unaligned instructions without fear- H ig h e r p e r f o r m a n c e on key media algorithms- More e n e r g y e f f i c i e n t than prior implementations

    Nehalem: Next Generation Intel Microarchitecture

    Faster Synchronization Primitives

  • 8/11/2019 Intel Nehalem Processor

    24/65

    24

    Faster Synchronization Primitives

    Multi-threaded softwarebecoming more prevalent

    S c a l a b i l i t y of multi-thread

    applications can be limitedby synchronization

    Synchronization primitives:LOCK prefix, XCHG

    Reduce synchronizationlatency for legacy software

    Greater thread s c a l a b i l i t y with Nehalem

    LOCK CMPXCHG Per fo rm ance

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Pentium 4 Core 2 Nehalem

    RelativeLaten

    cy

    Nehalem: Next Generation Intel Microarchitecture

    Simultaneous Multi-Threading (SMT)

  • 8/11/2019 Intel Nehalem Processor

    25/65

    25

    Simultaneous Multi-Threading (SMT)

    SMT

    - Run 2 threads at the same time per core Take advantage of 4-wide execution

    engine

    - Keep it fed with multiple threads

    - Hide latency of a single thread Most p o w e r e f f i c i e n t performance

    feature

    - Very low die area cost

    - Can provide significant performancebenefit depending on application

    - Much more efficient than adding anentire core

    Nehalem advantages- Larger caches- Massive memory BW

    Sim u l t a n e o u s m u l t i - t h r e a d i n g en h a n ce sp e r f o r m a n c e a n d e n e r g y e f f i c i e n c y

    Time(proc.cy

    cles)

    w/o SMT SMT

    Note: Each box

    represents a

    processorexecution unit

    SMT Implementation Details

  • 8/11/2019 Intel Nehalem Processor

    26/65

    26

    SMT Implementation Details

    Multiple policies possible for implementation of SMT Replicated Duplicate state for SMT

    - Register state- Renamed RSB- Large page ITLB

    Partitioned Statically allocated between threads- Key buffers: Load, store, Reorder- Small page ITLB

    Competitively shared Depends on threads dynamic behavior

    - Reservation station- Caches- Data TLBs, 2nd level TLB

    Unaware- Execution units

    Agenda

  • 8/11/2019 Intel Nehalem Processor

    27/65

    27

    Nehalem Design Philosophy Enhanced Processor Core

    Performance Features

    Simultaneous Multi-Threading Feeding the EngineNew Cache HierarchyNew Platform Architecture

    Performance AccelerationVirtualizationNew Instructions

    Agenda

    Feeding the Execution Engine

  • 8/11/2019 Intel Nehalem Processor

    28/65

    28

    Feeding the Execution Engine

    Powerful 4-wide dynamic execution engine

    Need to keep providing fuel to the execution engine

    Nehalem Goals

    - Lo w l a t e n cy to retrieve data Keep execution engine fed w/o stalling

    - High data b a n d w i d t h Handle requests from multiple cores/threads seamlessly

    - S c a l a b i l i t y Design for increasing core counts

    Combination of great ca ch e h i e r a r c h y and n e w p l a t f o r m

    N e h a l e m d e s i g n e d t o f e e d t h e e x e cu t i o n e n g i n e

    Nehalem: Next Generation Intel Microarchitecture

    Designed For Modularity

  • 8/11/2019 Intel Nehalem Processor

    29/65

    29

    Designed For Modularity

    Optimal price / performance / energy efficiency

    ptimal price / performance / energy efficiency

    for server desktop and mobile products

    or server desktop and mobile products

    DRAM

    QPI

    PI

    Core

    Uncore

    C

    O

    R

    E

    C

    O

    R

    E

    C

    O

    R

    E

    IMC QPIPowerPower

    &&

    ClockClock

    #QPI#QPILinksLinks

    ## memmemchannelschannels

    Size ofSize ofcachecache# cores# cores

    PowerPowerManageManage--mentment

    Type ofType ofMemoryMemory

    IntegratedIntegratedgraphicsgraphics

    Differentiation in the Uncore:

    2008

    2009 Servers & Desktops

    QPI

    L3 Cache

    QPI:

    Intel

    QuickPathInterconnect

    Intel Smart Cache Core Caches

  • 8/11/2019 Intel Nehalem Processor

    30/65

    30

    Intel Smart Cache Core Caches

    New 3-level Cache Hierarchy 1st level caches

    - 32kB Instruction cache

    - 32kB Data Cache Support more L1 misses in parallel

    than Core 2

    2nd level Cache

    - New cache introduced in Nehalem- Unified (holds code and data)- 256 kB per core

    - Pe r f o r m a n ce : Very low latency- S c a l a b i l i t y : As core countincreases, reduce pressure onshared cache

    Co r e

    2 5 6 k B

    L 2 Ca c h e

    3 2 k B L 1

    Da t a Ca c h e

    3 2 k B L 1

    I n s t . Ca c h e

    Nehalem: Next Generation Intel Microarchitecture

    Intel Smart Cache -- 3rd Level Cache

  • 8/11/2019 Intel Nehalem Processor

    31/65

    31

    Intel Smart Cache 3 Level Cache

    New 3rd level cache

    Shared across all cores

    Size depends on # of cores- Quad-core: Up to 8MB- S c a l a b i l i t y :

    Built to vary size with varied

    core counts Built to easily increase L3 size

    in future parts

    Inclusive cache policy forbest p e r f o r m a n ce

    - Data residing in L1/L2 m u s tbe present in 3rd level cache

    L3 Cache

    Co r e

    L 2 Ca c h e

    L 1 Ca c h e s

    Co r e

    L 2 Ca c h e

    L 1 Ca c h e s

    Co r e

    L 2 Ca c h e

    L 1 Ca c h e s

    Inclusive vs. Exclusive Caches

  • 8/11/2019 Intel Nehalem Processor

    32/65

    32

    c us e s c us e Cac esCache Miss

    E x c l u s i v e I n c l u s i v e

    Core

    0

    Core

    1

    Core

    2

    Core

    3

    L3 Cache

    Core

    0

    Core

    1

    Core

    2

    Core

    3

    L3 Cache

    Data request from Core 0 misses Core 0s L1 and L2

    Request sent to the L3 cache

    Inclusive vs. Exclusive Caches

  • 8/11/2019 Intel Nehalem Processor

    33/65

    33

    Cache Miss

    E x c l u s i v e I n c l u s i v e

    Core

    0

    Core

    1

    Core

    2

    Core

    3

    L3 Cache

    Core

    0

    Core

    1

    Core

    2

    Core

    3

    L3 Cache

    Core 0 looks up the L3 Cache

    Data not in the L3 Cache

    MISS! MISS!

    Inclusive vs. Exclusive Caches

  • 8/11/2019 Intel Nehalem Processor

    34/65

    34

    Cache Miss

    E x c l u s i v e I n c l u s i v e

    Core

    0

    Core

    1

    Core

    2

    Core

    3

    L3 Cache

    Core

    0

    Core

    1

    Core

    2

    Core

    3

    L3 CacheMISS! MISS!

    Must check other cores Guaranteed data is not on-die

    Greater s c a l a b i l i t y from inclusive approach

    Inclusive vs. Exclusive Caches

  • 8/11/2019 Intel Nehalem Processor

    35/65

    35

    Cache Hit

    E x c l u s i v e I n c l u s i v e

    Core

    0

    Core

    1

    Core

    2

    Core

    3

    L3 Cache

    Core

    0

    Core

    1

    Core

    2

    Core

    3

    L3 CacheHIT! HIT!

    No need to check other cores Data could be in another core

    BUT Nehalem is smart

    Nehalem: Next Generation Intel Microarchitecture

    Inclusive vs. Exclusive Caches

  • 8/11/2019 Intel Nehalem Processor

    36/65

    36

    Cache Hit

    I n c l u s i v e

    Core

    0

    Core

    1

    Core

    2

    Core

    3

    L3 CacheHIT!

    Core valid bits limitunnecessary snoops

    Maintain a set of corevalid bits per cache line inthe L3 cache

    Each bit represents a core

    If the L1/L2 of a core maycontain the cache line, thencore valid bit is set to 1

    No snoops of cores areneeded if no bits are set

    If more than 1 bit is set,line cannot be in Modified

    state in any core

    0 0 0 0

    Inclusive vs. Exclusive Caches

  • 8/11/2019 Intel Nehalem Processor

    37/65

    37

    Read from other core

    E x c l u s i v e I n c l u s i v e

    Core

    0

    Core

    1

    Core

    2

    Core

    3

    L3 Cache

    Core

    0

    Core

    1

    Core

    2

    Core

    3

    L3 CacheMISS! HIT!

    Must check all other cores Only need to check the core

    whose core valid bit is set

    0 0 1 0

    Todays Platform Architecture

  • 8/11/2019 Intel Nehalem Processor

    38/65

    38

    y

    Front-Side Bus Evolution

    ICHICH

    CPUCPU

    CPUCPU

    CPUCPU

    CPUCPU

    MCHMCH

    memory

    ICHICH

    CPUCPU

    CPUCPU

    CPUCPU

    CPUCPU

    MCHMCH

    memory

    ICHICH

    CPUCPU

    CPUCPU

    CPUCPU

    CPUCPU

    MCHMCH

    memory

    Nehalem-EP Platform Architecture

  • 8/11/2019 Intel Nehalem Processor

    39/65

    39

    Integrated Memory Controller- 3 DDR3 channels per socket- Massive memory b a n d w i d t h

    - Memory Bandwidth scales with# of processors

    - Very l o w m e m o r y la t e n c y Intel QuickPath Interconnect

    (QPI)- New point-to-point interconnect- Socket to socket connections- Socket to chipset connections- Build s c a l a b l e solutions

    Nehalem

    EP

    Nehalem

    EP

    Tylersburg

    EP

    Si g n i f i c a n t p e r f o r m a n ce l e a p f r o m n e w p l a t f o r m

    Nehalem: Next Generation Intel Microarchitecture

    Intel QuickPath Interconnect

  • 8/11/2019 Intel Nehalem Processor

    40/65

    40

    Nehalem introduces newIntel QuickPathInterconnect (QPI)

    H ig h b a n d w i d t h , l o wl a t e n c y point to pointinterconnect

    Up to 6.4 GT/sec initially

    - 6.4 GT/sec -> 12.8 GB/sec- Bi-directional link -> 25.6

    GB/sec per link

    - Future implementations at

    even higher speeds Highly s c a l a b l e for systems

    with varying # of sockets

    Nehalem

    EPNehalem

    EP

    IOH

    memoryCPU CPU

    CPU CPU

    IOH

    memory

    memory

    memory

    Nehalem: Next Generation Intel Microarchitecture

    Integrated Memory Controller (IMC)

  • 8/11/2019 Intel Nehalem Processor

    41/65

    41

    Memory controller optimized per marketsegment Initial Nehalem products

    - Native DDR3 IMC- Up to 3 channels per socket- Speeds up to DDR3-1333

    Massive m e m o r y b a n d w i d t h

    - Designed for l o w l a t e n c y - Support RDIMM and UDIMM- RAS Features

    Future products

    - S c a l a b i l i t y Vary # of memory channels Increase memory speeds Buffered and Non-Buffered solutions

    - Market specific needs Higher memory capacity Integrated graphics

    Nehalem

    EPNehalem

    EP

    Tylersburg

    EP

    DDR3 DDR3

    Si g n i f i c a n t p e r f o r m a n ce t h r o u g h n e w I M C

    Nehalem: Next Generation Intel Microarchitecture

    IMC Memory Bandwidth (BW)

  • 8/11/2019 Intel Nehalem Processor

    42/65

    42

    3 memory channels per socket

    Up to DDR3-1333 at launch- Massive m em o r y BW - HEDT: 32 GB/sec peak- 2S server: 64 GB/sec peak

    S c a l a b i l i t y - Design IMC and core to takeadvantage of BW

    - Allow performance to scale withcores

    Core enhancements Support more cache misses per

    core Aggressive hardware prefetching

    w/ throttling enhancements

    Example IMC Features Independent memory channels Aggressive Request Reordering

    HarpertownHarpertown

    1600 FSB1600 FSB0

    1

    2

    3

    4

    NehalemNehalem

    EPEP

    3x DDR33x DDR3

    --13331333

    2 socket Streams Triad

    >4Xbandwidth

    Ma ss iv e m em o r y BW p r o v i d e s p e r f o r m a n ce a n d s ca la b i l i t y

    Nehalem: Next Generation Intel Microarchitecture

    Source: Internal measurements

    Non-Uniform Memory Access (NUMA)

  • 8/11/2019 Intel Nehalem Processor

    43/65

    43

    FSB architecture- All memory in one location Starting with Nehalem

    - Memory located in multiple

    places Latency to memorydependent on location

    Local memory- Highest BW- Lowest latency

    Remote Memory- Higher latency

    Nehalem

    EPNehalem

    EP

    Tylersburg

    EP

    En s u r e s o f t w a r e i s N UMA - o p t i m i z e d f o r b e s t p e r f o r m a n c e

    Nehalem: Next Generation Intel Microarchitecture

    Local Memory Access

  • 8/11/2019 Intel Nehalem Processor

    44/65

    44

    CPU0 requests cache line X, not present in any CPU0 cache

    - CPU0 requests data from its DRAM- CPU0 snoops CPU1 to check if data is present Step 2:

    - DRAM returns data- CPU1 returns snoop response

    Local memory latency is the maximum latency of the two responses

    Nehalem optimized to keep key latencies close to each other

    CPU0 CPU1QPI

    DRAMDRAM

    Nehalem: Next Generation Intel Microarchitecture

    QPI:

    Intel

    QuickPathInterconnect

    Remote Memory Access

  • 8/11/2019 Intel Nehalem Processor

    45/65

    45

    CPU0 requests cache line X, not present in any CPU0 cache

    - CPU0 requests data from CPU1- Request sent over QPI to CPU1- CPU1s IMC makes request to its DRAM- CPU1 snoops internal caches

    - Data returned to CPU0 over QPI Remote memory latency a function of having a low latencyinterconnect

    CPU0 CPU1QPI

    DRAMDRAM

    QPI:

    Intel

    QuickPathInterconnect

    Memory Latency Comparison

  • 8/11/2019 Intel Nehalem Processor

    46/65

    46

    Lo w m em o r y l a t en c y critical to high performance Design integrated memory controller for low latency Need to optimize both local and remote memory latency Nehalem delivers

    - Huge reduction in local memory latency- Even remote memory latency is fast

    Effective memory latency depends per application/OS- Percentage of local vs. remote accesses- NHM has lower latency regardless of mix

    Relativ e Memory Latency Comparison

    0.00

    0.20

    0.40

    0.60

    0.80

    1.00

    Harpertown (FSB 1600) Nehalem (DDR3-1333) Local Nehalem (DDR3-1333) Remote

    RelativeMemoryLatency

    Nehalem: Next Generation Intel

    Microarchitecture

    Agenda

  • 8/11/2019 Intel Nehalem Processor

    47/65

    47

    Nehalem Design Philosophy Enhanced Processor CorePerformance FeaturesSimultaneous Multi-Threading

    Feeding the EngineNew Cache HierarchyNew Platform Architecture

    Performance AccelerationVirtualizationNew Instructions

    Virtualization

  • 8/11/2019 Intel Nehalem Processor

    48/65

    48

    To get best virtualized p e r f o r m a n ce - Have best native performance- Reduce:

    # of transitions into/out of virtual machine

    Latency of transitions Nehalem virtualization features

    - Reduced latency for transitions- Virtual Processor ID (VPID) to reduce effective cost of transitions

    - Extended Page Table (EPT) to reduce # of transitions

    Great virtualization performance w/ Nehalem

    Nehalem: Next Generation Intel Microarchitecture

    Virtualization Enhancements

  • 8/11/2019 Intel Nehalem Processor

    49/65

    49

    0%

    20%

    40%

    60%

    80%

    100%

    Relative

    Latenc

    y

    Merom Penryn Nehalem

    >40%>40%FasterFaster

    Greatly improved Virtualization performance

    reatly improved Virtualization performance

    Virtualization Latency

    irtualization Latency

    Time to enter and exit virtual machine

    Virtual

    ProcessorID (VPID)

    ExtendedPage

    Tables (EPT)

    Reduce frequency of

    invalidation of TLB entries

    Hardware to translateguest to host physicaladdress

    Previously done in VMMsoftware

    Removes #1 cause ofexits from virtual machine

    Allows guests full controlover page tables

    Architecture:rchitecture:

    Round Trip Virtualization Latency

    Source: pre-silicon/post-silicon measurements

    Nehalem: Next Generation Intel Microarchitecture

    Extended Page Tables (EPT) Motivation

  • 8/11/2019 Intel Nehalem Processor

    50/65

    50

    Guest

    OS

    VM1

    VMM

    CR3

    Guest Page Table

    CR3

    Active Page Table

    A VMM needs to protectphysical memory Multiple Guest OSs share

    the same physicalmemory

    Protections areimplemented throughpage-table virtualization

    Page table virtualizationaccounts for asignificant portion ofvirtualization overheads VM Exits / Entries

    The goal of EPT is to

    reduce these overheads

    Guest page table changescause exits into the VMM

    VMM maintains the activepage table, which is used

    by the CPU

    EPT Solution

  • 8/11/2019 Intel Nehalem Processor

    51/65

    51

    Intel 64 Page Tables

    - Map Guest Linear Address to Guest Physical Address- Can be read and written by the guest OS

    New EPT Page Tables under VMM Control

    - Map Guest Physical Address to Host Physical Address

    - Referenced by new EPT base pointer No VM Exits due to Page Faults, INVLPG or CR3

    accesses

    Intel

    64

    PageTables

    Guest

    Linear

    Address

    EPT

    PageTables

    CR3

    Guest

    Physical

    Address

    EPT

    Base Pointer

    Host

    Physical

    Address

    Extending Performance and Energy Efficiency- SSE4 . 2 I n s t r u c t i o n Se t A r c h i t e c t u r e ( I SA ) Le a d e r s h i p i n 2 0 0 8

  • 8/11/2019 Intel Nehalem Processor

    52/65

    52

    SSE4.2SSE4.2(Nehalem Core)(Nehalem Core)

    STTNISTTNIe.g. XMLe.g. XML

    accelerationacceleration

    POPCNTPOPCNTe.g. Genomee.g. Genome

    MiningMining

    ATAATA(Application(Application

    TargetedTargetedAccelerators)Accelerators)

    SSE4.1SSE4.1((PenrynPenryn Core)Core)

    SSE4SSE4(45nm CPUs)(45nm CPUs)

    CRC32CRC32e.g.e.g. iSCSIiSCSIApplicationApplication

    New CommunicationsNew Communications

    CapabilitiesCapabilities

    Hardware based CRCinstructionAccelerated Network

    attached storageImproved power efficiencyfor Software I-SCSI, RDMA,and SCTP

    Accelerated SearchingAccelerated Searching

    & Pattern Recognition& Pattern Recognitionof Large Data Setsof Large Data Sets

    Improved performance forGenome Mining,

    Handwriting recognition.Fast Hamming distance /Population count

    AcceleratedAcceleratedString and TextString and Text

    ProcessingProcessing

    Faster XML parsing

    Faster search and patternmatchingNovel parallel data matchingand comparison operations

    STTNI

    ATA

    W h a t s h o u l d t h e a p p l i c a t i o n s , O S a n d VM M v e n d o r s d o ?:

    U n d e r s t a n d t h e b e n e f i t s & t a k e ad v a n t a g e o f n e w i n s t r u c t i o n s in 2 0 0 8 .

    P r o v i d e u s f e e d b a c k o n i n s t r u c t i o n s I SV w o u l d l i k e t o s ee f o r

    n e x t g e n e r a t i o n o f a p p l ic a t i o n s

    STTNI - STring & Text New InstructionsOp e r a t e s o n s t r i n g s o f b y t e s o r w o r d s ( 1 6 b )

  • 8/11/2019 Intel Nehalem Processor

    53/65

    53

    Equal Each Instruction

    True for each character in Src2 ifsame position in Src1 is equalSr c1: Test \ t daySr c2: t ad t seTMask: 01101111

    Equal Ordered Instruction

    Finds the start of a substring (Src1)within another string (Src2)Src1: ABCA0XYZSrc2: S0BACBABMask: 00000010

    Equal Any Instruction

    True for each character in Src2if any character in Src1 matchesSr c1: Exampl e\ n

    Sr c2: at ad t sTMask: 10100000

    Ranges Instruction

    True if a character in Src2 is inat least one of up to 8 rangesin Src1Sr c1: AZ 0 9zzzSr c2: t aD t seTMask: 00100001

    Projected 3.8x kernel speedup on XML parsing &2.7x savings on instruction cycles

    STTNI MODEL

    x x xx xxx Tx x xx Txx x

    x T xx xxx x

    x x xx xTx x

    x x Fx xxx x

    x x xx xxT x

    x x xT xxx xF x xx xxx x

    t da st Te

    T

    st

    e

    ad\t

    y

    Check

    each bit

    in the

    diagonal

    Source1

    (XMM)

    Source2 (XMM / M128)

    IntRes1

    0 1 01 111 1

    Bit 0

    Bit 0

    STTNI ModelSource2 (XMM / M128)Source2 (XMM / M128)

    EQUAL ANY EQUAL EACH

  • 8/11/2019 Intel Nehalem Processor

    54/65

    54

    AND the results

    along each

    diagonal

    x x xx xxx Tx x xx Txx x

    x T xx xxx x

    x x xx xTx x

    x x Fx xxx xx x xx xxT x

    x x xT xxx xF x xx xxx x

    t da st Te

    T

    st

    e

    ad\t

    y

    Checkeach bit inthediagonal

    S

    ource1

    (XMM)

    Source2 (XMM / M128)

    IntRes10 1 01 111 1

    Bit 0

    Bit 0

    F F FF FF FFF F FF FFF F

    F F FF FFF F

    T T FF FFF F

    F F FF FFF FF F FF FFF F

    F F FF FFF FF F FF FFF F

    a a dt t Ts

    E

    am

    x

    elp

    \n

    OR results downeach column

    Source1

    (XMM)

    Source2 (XMM / M128)

    IntRes11 1 00 000 0

    Bit 0

    Bit 0

    Source1(XM

    M)

    Bit 0fF F TfF TFF FfF T FfF FTF x

    fF F FfF xFT xfF F TfF xxF xfT fTfTfT xxx xfT fT xfT xxx xfT x xfT xxx xfT x xx xxx x

    Source2 (XMM / M128)

    S BA0 ABC B

    A

    CA

    B

    YX0

    Z

    IntRes10 0 00 100 0

    Bit 0

    F TFF FF TTF TTF FFF T

    T TTT TTT T

    T TFT TTT TF F FF FFF FF F TF FFF F

    F F FF FFF FT TTT TTT T

    t Da st Te

    A

    09

    Z

    zzz

    z

    Source1

    (X

    MM)

    Source2 (XMM / M128)

    IntRes10 100 000 1

    Bit 0

    First Compare

    does GE, next

    does LEAND GE/LE pairs of results

    OR those results

    Bit 0

    EQUAL ORDEREDRANGES

    AND the results

    along each

    diagonal

    Bit 0

    Source1(X

    MM)

    Example Code For strlen()

    STTNI Version

  • 8/11/2019 Intel Nehalem Processor

    55/65

    55

    int sttni_strlen(const char * src)

    {

    char eom_vals[32] = {1, 255, 0};

    __asm{

    mov eax, src

    movdqu xmm2, eom_vals

    xor ecx, ecx

    topofloop:

    add eax, ecx

    movdqu xmm1, OWORD PTR[eax]

    pcmpistri xmm2, xmm1, imm8

    jnz topofloop

    endofstring:

    add eax, ecx

    sub eax, srcret

    }

    }

    string equ [esp + 4]mov ecx,string ; ecx -> string

    test ecx,3 ; test if string is aligned on 32 bitsje short main_loopstr_misaligned:

    ; simple byte loop until string is alignedmov al,byte ptr [ecx]add ecx,1test al,al

    je short byte_3

    test ecx,3jne short str_misalignedadd eax,dword ptr 0 ; 5 byte nop to align label belowalign 16 ; should be redundant

    main_loop:mov eax,dword ptr [ecx] ; read 4 bytesmov edx,7efefeffhadd edx,eax

    xor eax,-1xor eax,edxadd ecx,4test eax,81010100h

    je short main_loop; found zero byte in the loopmov eax,[ecx - 4]test al,al ; is it byte 0

    je short byte_0test ah,ah ; is it byte 1

    je short byte_1test eax,00ff0000h ; is it byte 2

    je short byte_2test eax,0ff000000h

    ; is it byte 3je short byte_3jmp short main_loop

    ; taken if bits 24-30 are clear and bit; 31 is setbyte_3:

    lea eax,[ecx - 1]mov ecx,stringsub eax,ecxret

    byte_2:lea eax,[ecx - 2]mov ecx,stringsub eax,ecx

    retbyte_1:lea eax,[ecx - 3]mov ecx,stringsub eax,ecxret

    byte_0:lea eax,[ecx - 4]

    mov ecx,stringsub eax,ecxret

    strlen endpend

    STTNI Version

    Cu r r e n t Co d e : M i n im um o f 1 1 i n s t r u c t i o n s ; I n n e r l o o p p r o ce ss es 4 b y t e s w i t h 8 i n s t r u c t io n s

    STTN I Co d e : M i n im um o f 1 0 i n s t r u c t i o n s ; A s in g l e i n n e r l o op p r o ce ss es 1 6 b y t e s w i t h o n l y 4i n s t r u c t i o n s

    ATA - Application Targeted Accelerators

    CRC32 POPCNTPOPCNT d t i th b f

  • 8/11/2019 Intel Nehalem Processor

    56/65

    56

    One register maintains the running CRC value as asoftware loop iterates over data.Fixed CRC polynomial = 11EDC6F41h

    Replaces complex instruction sequences for CRC in

    Upper layer data protocols: iSCSI, RDMA, SCTP

    SRC Data 8/16/32/64 bit

    Old CRC

    63 3132 0

    0 New CRC

    63 3132 0

    DST

    DST

    0

    X

    Accumulates a CRC32 value using the iSCSI polynomial

    En a b l e s e n t e r p r i s e c l a s s d a t a a ss u r a n ce w i t h h i g h d a t a r a t e s

    i n n e t w o r k e d s t o r a g e i n a n y u s e r e n v i r o n m e n t .

    0 1 0 . . . 0 0 1 1

    63 1 0 Bit

    0x3

    RAX

    RBX

    0ZF=? 0

    POPCNT determines the number of nonzero

    bits in the source.

    POPCNT is useful for speeding up fast matching indata mining workloads including: DNA/Genome Matching Voice Recognition

    ZFlag set if result is zero. All other flags (C,S,O,A,P)reset

    CRC32 Preliminary Performance

  • 8/11/2019 Intel Nehalem Processor

    57/65

    57

    crc32c_sse42_optimized_version(uint32 crc, unsigned

    char const *p, size_t len)

    { // Assuming len is a multiple of 0x10

    asm("pusha");

    asm("mov %0, %%eax" :: "m" (crc));

    asm("mov %0, %%ebx" :: "m" (p));

    asm("mov %0, %%ecx" :: "m" (len));

    asm("1:");

    // Processing four byte at a time: Unrolled four times:

    asm("crc32 %eax, 0x0(%ebx)");

    asm("crc32 %eax, 0x4(%ebx)");

    asm("crc32 %eax, 0x8(%ebx)");

    asm("crc32 %eax, 0xc(%ebx)");

    asm("add $0x10, %ebx")2;

    asm("sub $0x10, %ecx");

    asm("jecxz 2f");

    asm("jmp 1b");

    asm("2:");

    asm("mov %%eax, %0" : "=m" (crc));

    asm("popa");

    return crc;

    }}

    Preliminary tests involved Kernel code implementing

    CRC algorithms commonly used by iSCSI drivers.

    32-bit and 64-bit versions of the Kernel under test

    32-bit version processes 4 bytes of data using

    1 CRC32 instruction

    64-bit version processes 8 bytes of data using1 CRC32 instruction

    Input strings of sizes 48 bytes and 4KB used for the

    test

    32 - bit 64 - bit

    InputDataSize =48 bytes

    6.53 X 9.85 X

    InputDataSize = 4KB

    9.3 X 18.63 X

    CRC32 optimized Code

    P r e l im i n a r y Re su l t s sh o w CRC3 2 i n s t r u c t i o n o u t p e r f o r m i n g t h e

    f a s t e s t CRC3 2 C so f t w a r e a lg o r i t h m b y a b i g m a r g i n

    Tools Support of New Instructions

    - Intel C++ Compiler 10 x supports the new instructions

  • 8/11/2019 Intel Nehalem Processor

    58/65

    58

    - Intel C++ Compiler 10.x supports the new instructions SSE4.2 supported via intrinsics

    Inline assembly supported on both IA-32 and Intel64 targets

    Necessary to include required header files in order to access intrinsics for Supplemental SSE3

    for SSE4.1

    for SSE4.2

    - Intel Library Support XML Parser Library using string instructions will beta Spring 08 and release product

    in Fall 08

    IPP is investigating possible usages of new instructions

    - Microsoft* Visual Studio 2008 VC++ SSE4.2 supported via intrinsics

    Inline assembly supported on IA-32 only

    Necessary to include required header files in order to access intrinsics

    for Supplemental SSE3 for SSE4.1

    for SSE4.2

    VC++ 2008 tools masm, msdis, and debuggers recognize the new instructions

    Software Optimization Guidelines

  • 8/11/2019 Intel Nehalem Processor

    59/65

    59

    Most optimizations for Intel

    Core

    microarchitecturestill hold

    Examples of new optimization guidelines:- 16-byte unaligned loads/stores- Enhanced macrofusion rules- NUMA optimizations

    Nehalem SW Optimization Guide will be published

    Intel C++ Compiler will support settings forNehalem optimizations

    Nehalem: Next Generation Intel

    Microarchitecture

    Summary

  • 8/11/2019 Intel Nehalem Processor

    60/65

    60

    Nehalem - The Intel 45 nm Tock Processor Designed for

    - Pow e r Ef f i c i e n c y

    - S c a l a b i l i t y - Pe r f o r m a n ce

    Enhanced Processor Core

    Brand New Platform Architecture Extending ISA Leadership

    Additional sources ofinformation on this topic:

  • 8/11/2019 Intel Nehalem Processor

    61/65

    61

    information on this topic:

    DPTS003-High End Desktop Platform Overview for the NextGeneration Intel Microarchitecture (Nehalem) Processor

    -April 3 11:10-12:00; Room 3C+3D, 3rd floor

    NGMS002-Upcoming Intel 64 Instruction Set ArchitectureExtensions- April 3 16:00 17:50; Auditorium, 3rd floor

    NGMC001-Next Generation Intel Microarchitecture (Nehalem) andNew Instruction Extensions - Chalk Talk- April 3 17:50 18:30; Auditorium, 3rd floor

    Demos in the Advance Technology Zone on the 3rd floor

    More web based info: http://www.intel.com/technology/architecture-silicon/next-gen/index.htm

    Session Presentations - PDFs

    http://www.intel.com/technology/architecture-silicon/next-gen/index.htmhttp://www.intel.com/technology/architecture-silicon/next-gen/index.htmhttp://www.intel.com/technology/architecture-silicon/next-gen/index.htmhttp://www.intel.com/technology/architecture-silicon/next-gen/index.htmhttp://www.intel.com/technology/architecture-silicon/next-gen/index.htm
  • 8/11/2019 Intel Nehalem Processor

    62/65

    62

    The PDF of this Session presentation is available from ourIDF Content Catalog:

    https://intel.wingateweb.com/SHchina/catalog/controller/catalog

    These can also be found from links on www.intel.com/idf

    Please Fill out theSession Evaluation Form

    https://intel.wingateweb.com/SHchina/catalog/controller/cataloghttp://www.intel.com/idfhttp://www.intel.com/idfhttps://intel.wingateweb.com/SHchina/catalog/controller/catalog
  • 8/11/2019 Intel Nehalem Processor

    63/65

    63

    Session Evaluation Form

    Thank You for your input, we use it to improve futureIntel Developer Forum events

    Put in your lucky draw coupon to win the prize at theend of the track!

    You must be present to win!

    Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

    EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTEDBY THIS DOCUMENT EXCEPT AS PROVIDED IN INTELS TERMS AND CONDITIONS OF SALE FOR SUCH

  • 8/11/2019 Intel Nehalem Processor

    64/65

    64

    BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTELS TERMS AND CONDITIONS OF SALE FOR SUCH

    PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIEDWARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIESRELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANYPATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FORUSE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.

    Intel may make changes to specifications and product descriptions at any time, without notice.

    All products, dates, and figures specified are preliminary based on current expectations, and are subject tochange without notice.

    Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, whichmay cause the product to deviate from published specifications. Current characterized errata are availableon request.

    Nehalem, Harpertown, Tylersburg, Merom, Dothan, Penryn, Bloomfield and other code names featured are

    used internally within Intel to identify products that are in development and not yet publicly announced forrelease. Customers, licensees and other third parties are not authorized by Intel to use code names inadvertising, promotion or marketing of any product or services and any such use of Intel's internal codenames is at the sole risk of the user

    Performance tests and ratings are measured using specific computer systems and/or components andreflect the approximate performance of Intel products as measured by those tests. Any difference insystem hardware or software design or configuration may affect actual performance.

    Intel, Intel Inside, Xeon, Pentium, Core and the Intel logo are trademarks of Intel Corporation in the UnitedStates and other countries.

    *Other names and brands may be claimed as the property of others.

    Copyright 2008 Intel Corporation.

  • 8/11/2019 Intel Nehalem Processor

    65/65