Intel Processor Architecture-Core

Embed Size (px)

Citation preview

  • 8/14/2019 Intel Processor Architecture-Core

    1/155

    Intel Core Microarchitecture

    Intel Software College

  • 8/14/2019 Intel Processor Architecture-Core

    2/155

    2

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Objectives

    After completion of this module you will be able to describe

    Components of an IA processor

    Working flow of the instruction pipeline

    Notable features of the architecture

  • 8/14/2019 Intel Processor Architecture-Core

    3/155

    3

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Agenda

    Introduction

    Knowledge preparation

    Notable features

    Micro-architecture tour

    Coding considerations

  • 8/14/2019 Intel Processor Architecture-Core

    4/155

    4

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Agenda

    Introduction

    Knowledge preparation

    Notable features

    Micro-architecture tour

    Coding considerations

  • 8/14/2019 Intel Processor Architecture-Core

    5/155

    5

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software CollegeIndustrial Recognition

    PC Format May 2006PC Format May 2006

    Intel Strikes Back!Conroe is the name. Pistol. Pistol--whipping Athlonwhipping Athlon

    64s into burger meat is the game..64s into burger meat is the game..

    Intel Regains Performance Crown, Anandtech At 2.8 or 3.0GHz, a Conroe EE would offer even stronger performancethan what weve seen here.

    Intel Reveals Conroe Architecture, Extremetech And not only was the Intel system running at 2.66GHz a slowerclock rate than the top Pentium 4it was outpacing an overclocked

    Athlon 64 FX-60. Wrap your brain around that idea for a bit

    Conroe Benchmarks - Intel Showing Big StrengthHot Hardware.com

    Intel is poised to change the faceof the desktop computing landscape

    Intel Dishes the Knockout Punch to AMD with Conroe, GD Hardware.comthe results were far more than we could hope for and it'll beamusing to see AMD's response to this beat-down session

    Intel's Next Generation Microarchitecture UnveiledIntel's Next Generation Microarchitecture UnveiledReal World Tech

    Just as important as the technical innovations in Core MPUs, thismicroarchitecture will have a profound impact on the industry.

  • 8/14/2019 Intel Processor Architecture-Core

    6/155

    6

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Performance Summary

    Intel Core Microarchitecture dramatically boosts Intelplatform performance

    Conroe & Woodcrest drive clear Desktop/Server performance

    leadership Merom extends Intel Mobile performance leadership

    Intel Core Microarchitecture-based platforms set thebar in Performance and Energy Efficiency for the Multi-Core era

    Intels 3rd generation dual-core (while competition stuck on 1st

    generation)

    New Intel high-performance engine: Wider, Smarter, Faster, MoreEfficient

    The Core Effect: Intel Core Microarchitectureramp fuels broad roadmap accelerations

    Best Processor on the Planet: EnergyBest Processor on the Planet: EnergyBest Processor on the Planet: EnergyBest Processor on the Planet: Energy----Efficient PerformanceEfficient PerformanceEfficient PerformanceEfficient Performance 1111

    20% (Merom), 40% (Conroe), 80% (Woodcrest) Performance Boosts1 !

    1 Based on SPECint*_rate_base2000

  • 8/14/2019 Intel Processor Architecture-Core

    7/155

    7

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Agenda

    Introduction

    Knowledge preparation

    Architecture VS Microarchitecture

    CISC VS RISC

    Performance Measurements

    Pipeline Design

    Power and Energy

    Chip Multi-Processing

    Notable features

    Micro-architecture tour

    Coding considerations

  • 8/14/2019 Intel Processor Architecture-Core

    8/155

    8

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Architecture and Micro-architecture

    What is Computer Architecture?

    Architecture is the set of features which are externally visible:

    Instruction set

    Registers

    Addressing modes

    Bus protocols

    Intel Architectures (IA)

    IA32/X86 (8-bit, 16-bit and 32-bit Integer architecture) X87 (Floating Point extension)

    MMX (Multi-Media extension)

    SSE, SSE2, SSE3 (SIMD Streaming Extension)

    Intel 64/EM64T (64-bit Integer extension of IA32) IA64 (Intel new 64-bit architecture)

    Itanium/Itainium2 processor family

    ?? Go to detail!Go to detail!

  • 8/14/2019 Intel Processor Architecture-Core

    9/155

  • 8/14/2019 Intel Processor Architecture-Core

    10/155

    10

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Intel NetBurstP5 P6 Banias

    Intel Architecture History

    Architecture:Instruction set definitionand compatibility

    EPIC* (Itanium) IA-32 IXA* (XScale)

    Microarchitecture:Hardware implementationmaintaining instruction setcompatibility with high-levelarchitecture

    Processors:Productizedimplementation ofMicroarchitecture

    Examples:

    Examples:

    Examples:

    PentiumPentium ProPro

    PentiumPentium II/IIIII/IIIPentiumPentium

    PentiumPentium 44

    PentiumPentium DD

    XeonXeon

    PentiumPentium MM

    * IXA Intel Internet Exchange Architecture/ EPIC Explicitly Parallel Instruction Computing

  • 8/14/2019 Intel Processor Architecture-Core

    11/155

    11

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Mobile

    Microarchitecture

    Intel NetBurst

    + New Innovations

    Intel Core Microarchitecture Processors

    IntelIntelCoreCore 2 Duo/Quad/Extreme processors2 Duo/Quad/Extreme processors

  • 8/14/2019 Intel Processor Architecture-Core

    12/155

    12

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    RISC Approach to CPU design

    Optimize H/W for common basic operations

    Fixed instruction length

    Shorter Execution Pipeline

    Ease of Instruction Level Parallelism Large number of registers

    Less memory accesses

    Load/Store architecture

    Shorter Execution Pipeline

    Ease of advancing Loads Branch Hints

    Reduce pipeline flush events

    Exotic stuff to be implemented in S/W with minimal H/W support

    No complex H/W instructions

    Handle exceptional conditions in S/WExamples: MIPS, IBM Power and PowerPC, Sun Sparc

    Achieve Maximum performance byright partitioning between H/W and S/W

    (RISC = Reduced Instruction Set Computers)

  • 8/14/2019 Intel Processor Architecture-Core

    13/155

    13

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    CISC Approach to CPU design

    Rich architecture

    Variable length instructions.

    Complex addressing modes.

    On-chip HW / SW partitioning required

    H/W keeps executing simple stuff

    Complex instructions are emulated using u-code routinesfrom ROM

    More instructions treated as simple as more H/W is available

    COMPATIBILITY has some major advantages:

    Large (and forever increasing) software base

    Code development tools

    Expertise

    H/W - S/W spiral

    Example: Intel IA32, Motorola 680X0

    Maximize information passed to the HW

    (CISC = Complex Instruction Set Computers)

  • 8/14/2019 Intel Processor Architecture-Core

    14/155

    14

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Performance is the reciprocal of the Time of execution:

    Were:

    L = Code Length (# of machine instructions)CPI = Clock cycles Per Instruction

    Tc = Clock period (nSecs)

    Substitute:

    IPC = Instructions Per Cycle = 1/CPI

    F = Frequency = 1/Tc

    CTCPILExecutionofTimeePerformanc

    **

    1

    __

    1=

    L

    FIPCePerformanc

    *

    Improve Timing

    Arch Enhancements

    Improve ILP

    Performance Measurement

  • 8/14/2019 Intel Processor Architecture-Core

    15/155

    15

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Performance Measurement (cont.)

    Performance considerations:

    Which Code/Application to run?

    Which OS?

    Which other components in the

    platform? Under which thermal conditions?

    Multithreading? Multiprocessing?

    Benchmarks examples

    Industry Standard

    Spec (ISPEC, FSPEC)

    TPC

    Commercial

    SysMark MobileMark

    PCMark

    Sandra

    ScienceMark

    Applications

    Video (Windows Media encoder, DivX)

    Audio (Lame MP3)

    Compression (RAR)

    Content creation (3DSM, Photoshop, Premiere

    Latest Games (Doom III, FarCry, but changesfast)

    Specific industries use specific benchmarks

    Linux compilation, POVRay, LinPack, lmbench

  • 8/14/2019 Intel Processor Architecture-Core

    16/155

    16

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Design Considerations for Different

    Market SegmentsConstrains:

    Thermally, area constrained Desktop

    Unconstrained Extreme

    Very area constrained Value

    Thermally, Energy and Area constrained Mobile

    Thermally, Energy Servers

    Micro-architecture is the Art of Tradeoffs between:

    Schedule

    Requirements / Standards

    Performance

    Features

    Power / Energy

    Area / Cost

  • 8/14/2019 Intel Processor Architecture-Core

    17/155

    17

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Design Metrics

    IPC = Instructions per Cycle

    The more the better

    Latency same as Response Time The time interval between

    when any request for data is made and

    when the data transfer completes

    The less the better

    Throughput

    The amount of work completed by the system per unit of time.

    The more the better ops/sec

  • 8/14/2019 Intel Processor Architecture-Core

    18/155

    18

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    CPU Pipeline

    Break the work to smaller pieces

    Four basic stages of instruction life

    Fetch - bring instruction to core

    Decode - read operands from register

    Execute - perform the operation

    Writeback - save result to register

    Execution timing of simple instructions(legend: op src1,src2 dst)

    add eax, ebx eax F D E W

    sub ecx, edx ecx F D E W

    Increased throughput increased number of completed instructions per cycle

  • 8/14/2019 Intel Processor Architecture-Core

    19/155

    19

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Pipeline Design - Explore Parallelism

    New instruction not always depends on previous one

    Can start new instruction before previous one is finished

    ...if different stages use different H/W resources

    Run instructions in parallel (pipeline)

    Add eax, ebx eax F D E WSub ecx, edx ecx F D E W

    Or edi, esi edi F D E W

    Need to balance pipe stages

    Each stage should take same time for best throughput and utilization

    ExecDecodeFetch WB

    Clock cycle is determinedby the longest path!

    ExecDecodeFetch WB

    ExecDecodeFetch WB

    ExecDecodeFetch WB

  • 8/14/2019 Intel Processor Architecture-Core

    20/155

    20

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Pipeline Design Fighting Stalls

    Data flow dependency (instructions output/input)

    Solved by bypasses, renaming etc

    Control flow dependencies

    Solved by branch prediction

    Others (Cache misses, long latency instructions)

    Solved by other dynamic scheduling techniques

    ?? Go to detail!Go to detail!

  • 8/14/2019 Intel Processor Architecture-Core

    21/155

    21

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Race of CISC vs. RISC

    In modern CPUs Advanced -Architecture Techniques minimize theadvantages of RISC over CISC

    Branch Prediction

    Reduces the effect of extra pipeline stages

    Register Renaming

    Effectively Increase the Number of Registers

    Out Of Order

    Reduce Number of stalls caused by shortage of registers

    Speculative Execution

    Further Reduce Number of stalls

    Power saving features Reduce the overhead when not needed.

  • 8/14/2019 Intel Processor Architecture-Core

    22/155

    22

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    op Intels Take of the CICS/RISC Race

    (CISC) Instructions are translated into one or more (RISC)uop(micro-operation)s

    Fixed format Wide and simple

    Temp registers

    Usually one uop per instructionComplex instruction can be thousands of uops

    Stores divided into two uops (STA and STD)

    Fusion play games here

  • 8/14/2019 Intel Processor Architecture-Core

    23/155

    23

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Power and Energy

    Maximum power (TDP):

    Cooling requirements

    Cooling solution

    Computer form factor and acoustic noise

    Average power

    Battery life

    Electricity bill

    General calculation:

    P = frequency * voltage^2 * activity factor * capacitance + leakage

    Reducing TDP

    Less transistors and wires Smaller transistors and wires

    Power features less activity

    Low leakage transistors

    Reducing average power

    Energy efficiency

    Power states Lower leakage

  • 8/14/2019 Intel Processor Architecture-Core

    24/155

    24

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Dual/Multi Core and SMT

    Put more than one core per package

    Architectural change:

    Software must be multi-threaded or multi-process

    but backward compatible with multiprocessor systems (MP)

    Several ways of implementing it

    All of them being used

    Core

    LLC

    I/O

    Core

    LLC

    I/O

    Core

    LLC

    I/O

    Core

    LLC

    Core

    LLCI/O

    Core

    SMT: Run two (or more) threads on the same core, simultaneously

  • 8/14/2019 Intel Processor Architecture-Core

    25/155

    25

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Intel Approach

    While single core performance has increased due to clock speed,increased cache and improved ILP the biggest performance increases

    have come from the thread level parallelism.

    While single core performance has increased due to clock speed,increased cache and improved ILP the biggest performance increases

    have come from the thread level parallelism.

    1 Threads1 Threads

    IntelIntel

    PentiumPentium

    2 Threads2 Threads

    IntelIntel

    PentiumPentium

    With HTWith HT

    IntelIntel

    PentiumPentiumDDProcessorProcessor

    2 Threads2 Threads

    4 Threads4 Threads

    2 Threads2 Threads

    IntelIntel

    Core 2 DuoCore 2 Duo

    IntelIntel

    XQ6700*XQ6700*

    Q4 2000 Q2 2003 Q2 2005 Q3 2006 Q4 2006

    StateExecution UnitsCacheBus

    80 Threads80 Threads

    ?

  • 8/14/2019 Intel Processor Architecture-Core

    26/155

    26

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    A Acronym Cheat Sheet of Parallel

    ComputingCMP: Chip Multi Processor (two or more cores per package)

    Dual Core: two cores in same package

    Quad Core: four cores in same packageDP: Dual Processor (two packages)

    MP: Multi Processor (four or more packages)

    SMT: Symmetric Multi Threading (virtual multi core: HyperThreading)

    l S f C ll

  • 8/14/2019 Intel Processor Architecture-Core

    27/155

    27

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Agenda

    Introduction

    Knowledge preparation

    Notable features

    Wide Dynamic Execution

    Smart Memory Access

    Advanced Smart Cache Advanced Digital Media Boost

    Intelligent Power Capability

    Micro-architecture tour

    Coding considerations

    I t l S ft C ll

  • 8/14/2019 Intel Processor Architecture-Core

    28/155

    28

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Intel Core Micro-architecture Notable

    FeaturesIntel Wide Dynamic Execution

    14-stage efficient pipeline Wider execution path

    Advanced branch prediction Macro-fusion

    Roughly ~15% of all instructions areconditional branches

    Macro-fusion fuses a comparisonand jump to reduce micro-ops

    running down the pipeline Micro-fusion

    Merges the load and operationmicro-ops into one macro-op

    64-Bit Support

    Merom, Conroe, and Woodcrestsupport EM64T

    2M/4M

    shared L2Cache

    up to10.4 Gb/s

    FSB

    L1 D-Cache and D-TLB

    LoadLoad

    SchedulersSchedulers

    Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

    ALUBranch

    MMX/SSEFPmove

    DecodeDecode

    Rename/AllocRename/Alloc

    uCodeuCodeROMROM

    Instruction FetchInstruction Fetchandand PreDecodePreDecode

    ALUFAdd

    MMX/SSEFPmove

    ALUALUFMulFMul

    MMX/SSEMMX/SSEFPmoveFPmove

    Instruction QueueInstruction Queue

    StoreStore

    4444

    4444

    5555

    Intel Software College

  • 8/14/2019 Intel Processor Architecture-Core

    29/155

    29

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Intel Core Micro-architecture Notable

    Features (cont.)Intel Advanced Memory Access

    Improved prefetching

    Memory disambiguation Advance load before a possible data dependency (pointer conflict)

    Earlier loads hide memory latencies

    Intel Software College

  • 8/14/2019 Intel Processor Architecture-Core

    30/155

    30

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Intel Core Micro-architecture Notable

    Features (cont.)Intel Advanced Smart Cache

    Multi-core optimization

    Shared between the two cores Advanced Transfer Cache architecture

    Reduced bus traffic

    Both cores have full access to the entire cache

    Dynamic Cache sizing

    Intel Software College

  • 8/14/2019 Intel Processor Architecture-Core

    31/155

    31

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Intel Core Micro-architecture NotableFeatures (cont.)Advantages of Shared Cache

    CPU1 CPU2

    Memory

    Front Side Bus (FSB)

    Cache Line

    Shipping L2 Cache Line~Half access to memory

    Intel Software College

  • 8/14/2019 Intel Processor Architecture-Core

    32/155

    32

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    g

    CPU2

    Intel Core Micro-architecture NotableFeatures (cont.)Advantages of Shared Cache (cont.)

    CPU1

    Memory

    Front Side Bus (FSB)

    Cache Line

    L2 is shared:

    No need to ship cacheline

    Intel Software College

  • 8/14/2019 Intel Processor Architecture-Core

    33/155

    33

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Core Micro-architecture Notable

    Features (cont.)Intel Advanced Digital Media Boost

    Single Cycle SIMD Operation

    8 Single Precision Flops/cycle 4 Double Precision Flops/cycle

    Wide Operations

    128-bit packed Add

    128-bit packed Multiply 128-bit packed Load

    128-bit packed Store

    Support for Intel EM64T

    instructions

    CoreCore archarch

    PreviousPrevious

    X4X4

    Y4Y4

    X4opY4X4opY4

    SOURCESOURCE

    X1opY1X1opY1

    X3X3

    Y3Y3

    X3opY3X3opY3

    X2X2

    Y2Y2

    X2opY2X2opY2

    X1X1

    Y1Y1

    X1opY1X1opY1

    DESTDEST

    SSE/2/3 OPSSE/2/3 OP

    X2opY2X2opY2

    X3opY3X3opY3X4opY4X4opY4

    CLOCKCLOCK

    CYCLE 1CYCLE 1

    CLOCKCLOCK

    CYCLE 2CYCLE 2

    00127127

    CLOCKCLOCK

    CYCLE 1CYCLE 1

    SIMD OperationSIMD Operation(SSE/SSE2/SSE3/SSSE)(SSE/SSE2/SSE3/SSSE)

    Intel Software College

  • 8/14/2019 Intel Processor Architecture-Core

    34/155

    34

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Core Micro-architecture NotableFeatures

    Intel Advanced Digital Media Boost

    Additional Media Instructions - Supplemental Streaming SIMDExtensions 3 (SSSE3)

    16 new packed integer instructions

    Targeting video encode/decode

    Significantly improved strings

    REP MOVS and REP STOS ~8 bytes / cycle throughput

    mileage may vary

    Intel Software College

  • 8/14/2019 Intel Processor Architecture-Core

    35/155

    35

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Core Micro-architecture NotableFeatures

    Intel Advanced Digital Media Boost

    Supplemental SSE-3 (SSSE-3)

    Packed SIGN

    Packed Shuffle Bytes

    Packed multiply High withRound and Scale

    Multiply and Add PackedSigned/Unsigned bytes

    Packed Align Right

    Packed Absolute Values

    Horizontal Addition/Subtraction

    PSIGNB/W/D

    PSHUFB

    PMULHRSW

    PALIGNR

    PMADDUBSW

    PABSB, PABSW, PABSD

    PHADDW, PHADDSW, PHADDD,

    PHSUBW, PHSUBSW, PHSUBD

    Intel Software College

  • 8/14/2019 Intel Processor Architecture-Core

    36/155

    36

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Core Micro-architecture Notable

    Features (cont.)Intelligent Power Capability

    Advanced power gating & Dynamic power coordination

    Multi-point demand-based switching Voltage-Frequency switching separation

    Supports transitions to deeper sleep modes

    Event blocking

    Clock partitioning and recovery Dynamic Bus Parking

    During periods of high performance execution, many parts of thechip core can be shut off

    Intel Software College

  • 8/14/2019 Intel Processor Architecture-Core

    37/155

    37

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Agenda

    Introduction

    Knowledge preparation

    Notable features

    Micro-architecture tour

    Front End

    Out-Of-Order Execution Core

    Memory Sub-system

    Coding considerations

    Intel Software College

  • 8/14/2019 Intel Processor Architecture-Core

    38/155

    38

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Core Micro-architecture Drill-down

    icachebranch

    prediction

    unit

    instructionqueue

    MS

    instructiondecode

    predecode

    registeralias table

    ALLOC Re-Order Buffer

    ReservationStation

    integer

    FPSIMD(3x)

    load

    storeaddress

    store

    data

    memoryorderbuffer

    datacacheunit

    page miss handler

    Intel Software College

    d

  • 8/14/2019 Intel Processor Architecture-Core

    39/155

    39

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Agenda

    Introduction

    Knowledge refreshment

    Notable features

    Micro-architecture tour

    Front End

    Out-Of-Order Execution Core

    Memory Sub-system

    Coding considerations

    Intel Software College

    C Mi hit t F t E d

  • 8/14/2019 Intel Processor Architecture-Core

    40/155

    40

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Core Micro-architecture Front End

    Instruction preparation before executed

    Instruction Fetch Unit

    Instruction Queue Instruction Decode Unit

    Branch Prediction Unit

    branchprediction

    unit

    MS

    instructiondecode

    icache

    instructionqueue

    predecode

    Intel Software College

    I t ti QIntel Core Microarchitecture Front End

  • 8/14/2019 Intel Processor Architecture-Core

    41/155

    41

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Instruction Queue

    Buffer between instruction pre-decode unit and decoder

    up to six predecoded instructions written per cycle

    18 Instructions contained in IQ up to 5 Instructions read from IQ

    Potential Loop cache

    Loop Stream Detector (LSD) support

    Re-use of decoded instruction

    Potential power saving

    Intel Software College

    Mac o F sionIntel Core Microarchitecture Front End

  • 8/14/2019 Intel Processor Architecture-Core

    42/155

    42

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Macro - Fusion

    Roughly ~15% of all instructions areconditional branches.

    Macro-fusion merges two instructionsinto a single micro-op, as if the twoinstructions were a single longinstruction.

    Enhanced Arithmetic Logic Unit (ALU)for macro-fusion. Each macro-fusedinstruction executes with a singledispatch.

    Not supported in EM64T long mode

    cmpjae eax, [mem], label

    Scheduler

    Execution

    flags and target to Write back

    Branch

    Eval

    Intel Software CollegeIntel Core Microarchitecture Front End

  • 8/14/2019 Intel Processor Architecture-Core

    43/155

    43

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Instruction Queue

    addps xmm0, [EAX+16]

    dec0

    Cycle 2

    Cycle 1

    mulps xmm0, xmm0

    mulps xmm0, xmm0

    movps [EAX+240], xmm0

    addps xmm0, [EAX+16]

    cmp eax, 100000

    dec1

    dec2

    dec3

    jge label

    movps [EAX+240], xmm0

    Macro-Fusion Absent

    Read four instructions fromInstruction Queue

    Each instruction gets decodedinto separate uops

    Enabling Example

    for (int i=0; i

  • 8/14/2019 Intel Processor Architecture-Core

    44/155

    44

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Instruction Queue

    addps xmm0, [EAX+16]

    dec0Cycle 1

    mulps xmm0, xmm0

    mulps xmm0, xmm0

    movps [EAX+240], xmm0

    addps xmm0, [EAX+16]

    cmpjae eax, 100000, label

    dec1

    dec2dec3

    movps [EAX+240], xmm0

    Macro-Fusion Presented

    Read five Instructions fromInstruction Queue

    Send fusable pair to single

    decoder

    Single uop represents twoinstructions

    Enabling Examplefor (unsigned int i=0;i

  • 8/14/2019 Intel Processor Architecture-Core

    45/155

    45

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Instruction Decode / Micro-Op Fusion

    Frequent pairs of micro-operations derived from the sameMacro Instruction can be fused into a single micro-operation

    Micro-op fusion effectively widens the pipeline

    Intel Software College

    Instruction Decode / Micro-Fusion (cont )Intel Core Microarchitecture Front End

  • 8/14/2019 Intel Processor Architecture-Core

    46/155

    46

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    std xmm0, [eax+240]

    Instruction Decode / Micro-Fusion (cont.)

    u-ops of a Store movps [EAX+240], xmm0

    sta eax+240st xmm0, [eax+240]

    Intel Software College

    Branch Prediction ImprovementsIntel Core Microarchitecture Front End

  • 8/14/2019 Intel Processor Architecture-Core

    47/155

    47

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Branch Prediction Improvements

    Intel Pentium 4 Processor branch predictionPLUS the following two improvements:

    Branch miss-predictions reduced by >20%

    Indirect Branch Predictor Loop Detector

    Intel Software College

    Agenda

  • 8/14/2019 Intel Processor Architecture-Core

    48/155

    48

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Agenda

    Introduction

    Knowledge preparation

    Notable features

    Micro-architecture tour

    Front End

    Out-Of-Order Execution Core

    Memory Sub-system

    Coding considerations

    Intel Software College

    Core Micro-architecture Execution Core

  • 8/14/2019 Intel Processor Architecture-Core

    49/155

    49

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Core Micro architecture Execution Core

    Accepted decoded u-ops, assign resources,execute and retire u-ops

    Renamer

    Reservation station (RS)

    Issue ports

    Execution Unit

    integerFP

    SIMD

    (3x)

    load

    storeaddress

    storedata

    registeralias table

    ALLOCRe-Order Buffer

    ReservationStation

    Intel Software College

    Execution Core Building BlocksIntel Core Microarchitecture Execution Core

  • 8/14/2019 Intel Processor Architecture-Core

    50/155

    50

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Execution Core Building Blocks

    Ports (number)Ports (number)

    2 Load2 Load

    3,4 Store3,4 Store

    Memory SubMemory Sub--systemsystem

    0,1,50,1,5

    SIMDSIMD

    IntegerInteger

    SIMD/IntegerSIMD/Integer

    MULMUL0,1,50,1,5

    IntegerInteger

    0,1,50,1,5

    FloatingFloating

    PointPoint

    Execution UnitExecution UnitROBROB

    RenamerRenamer

    RSRS

    Intel Software College

    Issue Ports and Execution UnitsIntel Core Microarchitecture Execution Core

  • 8/14/2019 Intel Processor Architecture-Core

    51/155

    51

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Issue Ports and Execution Units

    6 dispatch ports from RS 3 execution ports

    (shared for integer / fp / simd)

    load

    store (address)

    store (data)

    128-bit SSE implementation

    Port 0 has packed multiply (4 cycles SP 5 DP pipelined)

    Port 1 has packed add (3 cycles all precisions)

    Intel Software College

    Retirement UnitIntel Core Microarchitecture Execution Core

  • 8/14/2019 Intel Processor Architecture-Core

    52/155

    52

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    ReOrder Buffer (ROB)

    Holds micro-ops in various stages of completion

    Buffers completed micro-ops updates the architectural state in order

    manages ordering of exceptions

    registeralias table

    ALLOC Re-Order Buffer

    ReservationStation

    Intel Software College

    Agenda

  • 8/14/2019 Intel Processor Architecture-Core

    53/155

    53

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    g

    Introduction

    Knowledge preparation

    Notable features

    Micro-architecture tour

    Front End

    Out-Of-Order Execution Core

    Memory Sub-system

    Coding considerations

    Intel Software College

    Core Micro-architecture Memory Sub-

  • 8/14/2019 Intel Processor Architecture-Core

    54/155

    54

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    ySystem

    Memory Ordering Buffer

    Store Address Buffer Stores the address of each store not actually performed

    Loads compare address to any store older than itself If it find a hole

    Store Data Buffer Stores data of each store not actually performed If load hit on the SAB, it forward the data from here

    Load Buffer Stores address of non-retired loads For snoops and re-dispatch

    One 128-bit load and one 128-bit store per cycle to different

    memory locations Out of order Memory operations

    Intel Software College

    Core Micro-architecture Memory Sub-

    Intel Core Microarchitecture Memory Sub-system

  • 8/14/2019 Intel Processor Architecture-Core

    55/155

    55

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Core Micro-architecture Memory Sub-

    System (cont.)32k D-Cache (8-way, 64 byte line size)

    Shared second level (L2) 2MB 8-way or 4MB 16-way instruction and data cache

    Cache to cache transfer

    improves producer / consumer style MP

    Wider interface to L2

    reduced interference

    processor line fill is 2 cycles

    Higher bandwidth from the L2 cache to the core

    ~14 clock latency and 2 clock throughput

    Load & Store Access order1. L1 cache of immediate core

    2. L1 cache of the other core

    3. L2 cache

    4. Memory

    BusBusBusBusBusBusBusBus

    2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache

    Core1Core1Core1Core1Core1Core1Core1Core1 Core2Core2Core2Core2Core2Core2Core2Core2

  • 8/14/2019 Intel Processor Architecture-Core

    56/155

    Intel Software College

    Advanced Memory Access / Enhanced DataIntel Core Microarchitecture Memory Sub-system

  • 8/14/2019 Intel Processor Architecture-Core

    57/155

    57Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Pre-fetch Logic (cont.) L1D cache prefetching

    Data Cache Unit Prefetcher Known as the streaming prefetcher Recognizes ascending access patterns in recently loaded data Prefetches the next line into the processors cache

    Instruction Based Stride Prefetcher Prefetches based upon a load having a regular stride Can prefetch forward or backward 2 Kbytes

    1/2 default page size

    L2 cache prefetching: Data Prefetch Logic (DPL) Prefetches data to the 2nd level cache before the DCU requests

    the data Maintains 2 tables for tracking loads

    Upstream 16 entries Downstream 4 entries

    Every load is either found in the DPL or generates a new entry Upon recognition of the 2nd load of a stream the DPL will

    prefetch the next load

    Intel Software College

    Advanced Memory Access / MemoryIntel Core Microarchitecture Memory Sub-system

  • 8/14/2019 Intel Processor Architecture-Core

    58/155

    58Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Disambiguation

    Memory Disambiguation predictor

    Loads that are predicted NOT to forward from preceding storeare allowed to schedule as early as possible

    increasing the performance of OOO memory pipelines

    Disambiguated loads checked at retirement

    Extension to existing coherency mechanism

    Invisible to software and system

    Intel Software College

    Advanced Memory Access / MemoryIntel Core Microarchitecture Memory Sub-system

  • 8/14/2019 Intel Processor Architecture-Core

    59/155

    59Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Disambiguation Absent

    Load4 must WAIT until previous stores complete

    Memory

    Data Y

    Data Z

    Data W

    Data X

    Load2 Y

    Store3 W

    Store1 Y

    Load4 X

  • 8/14/2019 Intel Processor Architecture-Core

    60/155

    Intel Software College

    Advanced Memory Access / StoresF di

    Intel Core Microarchitecture Memory Sub-system

  • 8/14/2019 Intel Processor Architecture-Core

    61/155

    61Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Forwarding

    If a load follows a store and reloads the data that the storewrites to memory, the micro-architecture can forward the datadirectly from the store to the load

    Memory

    Data Y

    Load2 Y

    Store1 YInternal

    Buffers

    Intel Software College

    Advanced Memory Access / StoresF di Ali d St C

  • 8/14/2019 Intel Processor Architecture-Core

    62/155

    62Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Forwarding: Aligned Store Cases

    ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8

    load 16 load 16 load 16 load 16 load 16 load 16 load 16 load 16

    load 32 bit load 32 bit load 32 bit load 32 bit

    load 64 bit load 64 bit

    load 128 bit

    store 128 bit

    ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8

    load 16 load 16 load 16 load 16

    load 32 bit load 32 bit

    load 64 bit

    store 64 bit

    ld 8 ld 8 ld 8 ld 8

    load 16 load 16

    load 32 bit

    store 32 bit

    ld 8 ld 8

    load 16

    store 16

    Intel Software College

    Advanced Memory Access / StoresForwarding: Unaligned Cases

  • 8/14/2019 Intel Processor Architecture-Core

    63/155

    63Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Forwarding: Unaligned Cases

    Note that unaligned store forward does not occur when the loadcrosses a cache line boundary

    ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8

    load 16 load 16 load 16 load 16

    load 32 bit load 32 bit

    load 64 bit

    store 64 bit

    ld 8 ld 8 ld 8 ld 8

    load 16 load 16

    load 32 bit

    store 32 bit

    ld 8 ld 8

    load 16

    store 16

    ld 8

    ld 8 Store forwarded to load

    No forwarding: No forwarding if the load

    crosses a cache line boundary

    Note: Unaligned 128-bit stores

    are issued as two 64-bit stores.This provides twoalignments for

    store forwarding

    Intel Software College

    Agenda

  • 8/14/2019 Intel Processor Architecture-Core

    64/155

    64Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Introduction

    Knowledge preparation

    Notable features

    Micro-architecture tour

    Coding considerations

    Intel Software College

    Optimizing forInstruction Fetch and PreDecode

  • 8/14/2019 Intel Processor Architecture-Core

    65/155

    65Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Instruction Fetch and PreDecode

    Avoid Length Changing Prefixes (LCPs)

    Affects instructions with immediate data or offset

    Operand Size Override (66H)Address Size Override (67H) [obsolete]

    LCPs change the length decoding algorithm increasing theprocessing time from one cycle to six cycles (or eleven cycles

    when the instruction spans a 16-byte boundary)

    The REX (EM64T) prefix (4xH) is not an LCP

    The REX prefix does lengthen the instruction by one byte, so useof the first eight general registers in EM64T is preferred

    Intel Software College

    Optimizing forInstruction Queue

  • 8/14/2019 Intel Processor Architecture-Core

    66/155

    66Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Instruction Queue

    Includes a Loop Stream Detector (LSD)

    Potentially very high bandwidth instruction streaming

    A number of requirements to make use of the LSD Maximum of 18 instructions in up to four 16-byte packets

    No RET instructions (hence, littlepracticaluse for CALLs)

    Up to four taken branches allowed

    Most effective at 70+ iterations LSD is after PreDecode so there is no added cost for LCPs

    Trade-off LSD with conventional loop unrolling

    Intel Software College

    Optimizing forDecode

  • 8/14/2019 Intel Processor Architecture-Core

    67/155

    67Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Decode

    Decoder issues up to 4 uOps for renaming/ allocation per clock

    This creates a trade off between more complex instructionuOps versus multiple simple instruction uOps

    For example, a single four uOp instruction is all that can berenamed/allocated in a single clock

    In some cases, multiple simple instructions may be a better

    choice than a single complex instruction Single uOp instructions allow more decoder flexibility

    For example, 4-1-1-1 can be decodedin one clock

    However, 2-2-2-1 takes three clocks to decode

    Intel Software College

    Optimizing forExecution

  • 8/14/2019 Intel Processor Architecture-Core

    68/155

    68

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Execution

    Up to six uOps can be dispatched per clock

    Store Data and Store Address dispatch ports are combined onthe block diagram

    Up to four results can be written back per clock

    Single clock latency operations are best

    Differing latency operations can create writeback conflicts

    Separate multiple-clock uOps with several single uOp instructions

    Typical instructions here: ADC/SBB, RWM, CMOVcc

    In some cases, separating a RMW instruction into its piece might befaster (decode and scheduling flexibility)

    When equivalent, PS preferred to PD (LCP)

    For example, MOVAPS over MOVAPD, XORPS over XORPD

    Intel Software College

    Optimizing forExecution (cont )

  • 8/14/2019 Intel Processor Architecture-Core

    69/155

    69

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Execution (cont.)

    Bypass register access preferred to register reads

    Partial register accesses often lead to stalls

    Register size access that conflicts with recent previous register

    write Partial XMM updates subject to dependency delays

    Partial flag stall can occur, too much higher cost Use TEST instruction between shift and conditional to prevent

    Common zeroing instructions (e.g., XOR reg,reg) dont stall

    Avoid bypass between execution domains

    For example: FP (ADDPS) and logical ops (PAND) on XMMn

    Vectorization: careful packing/unpacking sequence

    Use MXCSRs FZ and DAZ controls as appropriate

    Intel Software College

    Optimizing forMemory

  • 8/14/2019 Intel Processor Architecture-Core

    70/155

    70

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Memory

    Software prefetch instructions

    Can reach beyond a page boundary (including page walk)

    Prefetches only when it completes without an exception

    General techniques to help these prefetchers

    Organize data in consecutive lines

    In general, increasing addresses are more easily prefetched

    Intel Software College

    Summary

  • 8/14/2019 Intel Processor Architecture-Core

    71/155

    71

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    What has been covered

    Notable features of Core Micro-architecture

    Wide Dynamic Execution

    Advanced Memory Access

    Advanced Smart Cache

    Advanced Digital Media Boost

    Power Efficient Support

    Core Micro-architecture components

    Front End

    OOO execution core

    Memory sub-system

    Intel Software College

  • 8/14/2019 Intel Processor Architecture-Core

    72/155

    72

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel Software College

    Platform

  • 8/14/2019 Intel Processor Architecture-Core

    73/155

    73

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Intel provides most of the siliconon any computer

    Classical platform partition

    CPU Computation

    MCH high speed IO

    ICH low speed IO

    Graphics speed and memorylatencies will require differentpartition

    This presentation focuses on thecore microarchitecture

    PCI (IO)SATAUSB

    KBRDothers

    FSB

    FSB

    ICH

    Legacy & Debug I/O

    Core

    Core

    LLC

    MEHD video

    PCIeDisplay

    PEG

    Analog

    DMI

    DMIMCH

    CPU

    MEMDDR

    TVout

    Graphics

    Wireless

    Intel Software College

    Intel 64 = Extending IA-32 to 64 Bit

  • 8/14/2019 Intel Processor Architecture-Core

    74/155

    74

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Added to Intel XEONAdded to Intel XEONand Pentiumand Pentium4 Processor in 2004; today4 Processor in 2004; todayavailable in all main stream Intel IAavailable in all main stream Intel IA--32 processors32 processorsin particular inin particular in

    all processors based on Intelall processors based on IntelCoreCoreArchitectureArchitecture

    Additional Registers8-SSE & 8-Gen Purpose

    Additional RegistersAdditional Registers

    88--SSE & 8SSE & 8--GenGen PurposePurpose

    Double Precision (64-bit)Integer Support

    Double Precision (64Double Precision (64--bit)bit)

    Integer SupportInteger Support

    Extended MemoryAddressability

    64-Bit Pointers, Registers

    Extended MemoryExtended Memory

    AddressabilityAddressability6464--Bit Pointers, RegistersBit Pointers, Registers

    ++ ==With 64With 64--BitBitExtensionExtension

    TechnologyTechnology

    Intel Software College

    Intel 64 - New Modes of Operation

  • 8/14/2019 Intel Processor Architecture-Core

    75/155

    75

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    16

    1616

    16

    32

    32

    64

    GPRWidt

    h

    32

    32

    64

    AddrSize

    Defaults

    32

    32

    32

    OperandSize

    No

    No

    Yes

    NewRegs

    No

    No

    Yes

    RIPRel.

    No

    Yes

    Yes

    64-bit

    IP

    New Features

    No

    Legacy 32-

    bit or16-bit

    OS

    Legacy Mode

    (IA32 Mode)

    NoCompatibility

    Mode

    Yes

    New64-bit

    OS

    64-bitMode

    LongMode

    Compilerequired

    OSReqd

    Mode

    Intel Software College

    Registers : Extensions and Additions

    EIPRIP

  • 8/14/2019 Intel Processor Architecture-Core

    76/155

    76

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    R8

    R9

    R10

    R11

    R12

    R13

    R14

    R15

    ESPRSP

    EDIRDI

    ESIRSI

    EBPRBP

    EDXRDX

    ECXRCX

    EBXRBX

    EAXRAX

    63 32 31 0

    XMM15

    XMM14

    XMM13

    XMM12

    XMM11

    XMM10

    XMM9

    XMM8

    XMM7

    XMM6

    XMM5

    XMM4

    XMM3

    XMM2

    XMM1

    XMM0

    EIPRIP

    127 64 63 0

    079

    X87/MMX

    Intel Software College

    Registers : Availability in different

  • 8/14/2019 Intel Processor Architecture-Core

    77/155

    77

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    modes

    Intel Software College

    64-bit Mode of Operation

  • 8/14/2019 Intel Processor Architecture-Core

    78/155

    78

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Default data size is 32-bits

    Override to 64-bits using new REX prefix

    All registers are 64-bit, 32-bit, 16-bit and 8-bit addressableREX prefixes

    A family of 16 prefixed, encoded 0x40-0x4F

    Allows the use of general purpose registers as 64-bits Allows the use of new registers (like r8-r15)

    Instructions that set a 32 bit register automatically zero extendthe upper 32-bits

    Intel Software College

    REX Prefix

  • 8/14/2019 Intel Processor Architecture-Core

    79/155

    79

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    A new instruction-prefix byte used in 64-bit mode

    Specify the new GPRs and SSE registers

    Specify a 64-bit operand size.

    Specify extended control registers (used by system software)

    An instruction can only have one REX prefix and if used, must immediatelyprecede the opcode or the two-byte opcode escape prefix .

    The legacy instruction-size limit of 15 bytes still applies to instructions that

    contains a REX prefix.

    Intel Software College

    Physical and Linear Addressing

  • 8/14/2019 Intel Processor Architecture-Core

    80/155

    80

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Linear Addressing

    Initial Intel 64 implementation support 48bits of Virtual addressing.

    Addresses are required to be in canonicalform bits 47 thru 63 must all be 1 or all be 0.

    Physical Addressing

    Initial Netburst Intel 64 implementationsupport 36 bit, today all current processorssupport 40bit at least

    Entries in page tables expanded for up to 52bits of physical address.

    Intel Software College

    Intel64 - Large Memory Considerations

  • 8/14/2019 Intel Processor Architecture-Core

    81/155

    81

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Canonical addressing for 64 bit addresses

    Although the architecture now allows calculating flat

    addresses to 64 bits, todays processors limit virtualaddressing to 48 bits

    Canonical address definition: An address that has addressbit 63 through 47 set to either all ones or all zeros

    Canonical addresses are a requirement

    Values for addresses that are not canonical will cause faultswhen put into locations expecting a valid address, such assegment registers

    ReturnReturn

    Intel Software College

    Introducing SIMD: Single InstructionMultiple Data

  • 8/14/2019 Intel Processor Architecture-Core

    82/155

    82

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    ++

    Scalar processing

    traditional mode

    one operation produces

    one result

    SIMD processing

    with SSE / SSE2

    one operation produces

    multiple results

    XX

    YY

    X + YX + Y

    ++

    x3x3 x2x2 x1x1 x0x0

    y3y3 y2y2 y1y1 y0y0

    x3+y3x3+y3 x2+y2x2+y2 x1+y1x1+y1 x0+y0x0+y0

    XX

    YY

    X + YX + Y

    Intel Software College

    SSE RegistersMMX Technology /IA-INT

    X86 Register SetsSSE-Registers introduced first in Pentium 3

  • 8/14/2019 Intel Processor Architecture-Core

    83/155

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    128

    Eight 128Eight 128--bit registersbit registers

    Hold data only:Hold data only:

    4 x single FP numbers4 x single FP numbers

    2 x double FP numbers2 x double FP numbers 128128--bit packed integersbit packed integers

    Direct access to the registersDirect access to the registers

    Use simultaneously with FP /Use simultaneously with FP /MMX TechnologyMMX Technology

    IA-FP Registers

    8064

    Eight 80/64Eight 80/64--bit registersbit registers

    Hold data onlyHold data only

    Stack access to FP0..FP7Stack access to FP0..FP7

    Direct access to MM0..MM7Direct access to MM0..MM7

    No MMXNo MMX Technology / FPTechnology / FPinteroperabilityinteroperability

    Registers

    32

    Fourteen 32Fourteen 32--bit registersbit registers

    Scalar data & addressesScalar data & addresses Direct access toDirect access to regsregs

    mm0mm0

    mm7mm7

    xmm0xmm0

    xmm7xmm7

    st0st0

    st7st7

    eaxeax

    ediedi

    Intel Software College

    Instruction Set Extensions

  • 8/14/2019 Intel Processor Architecture-Core

    84/155

    84

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Beginning in 2008: ~50 new instructions in 13 groups

    All function in 32-bit and 64-bit modes

    Improvements in Commercial Data Integrity i-SCSI, Video Processing, String and Text Processing, 2D &3D Imaging, Vectorizing Compiler Performance

    New Instructions Added to Intel Processors

    56 70

    144

    13

    32

    50

    0

    20

    40

    6080

    100

    120

    140

    160

    Jan-97 Feb-99 Dec-00 Feb-04 Jul-06 2008+

    MMX Streaming SIMDExtensions (SSE)

    Streaming SIMDExtensions 2 (SSE2)

    Streaming SIMDExtensions 3 (SSE3)

    Supplemental SSE3(SSSE3)

    Future Intel instructionset extensions

    350 250 180 90 65 45Process (nm)

    ~32

    Future

    SSE-4

    45 nm

    Intel Software College

    SSE and SSE-2 Data Types

  • 8/14/2019 Intel Processor Architecture-Core

    85/155

    85

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    4x floats4x floatsSSE

    16x bytes16x bytes

    8x 168x 16--bit shortsbit shorts

    4x 324x 32--bit integersbit integers

    2x 642x 64--bit integersbit integers

    1x 1281x 128--bit(!) integerbit(!) integer

    2x doubles2x doubles

    SSE-2

    Intel Software College

    SSE-Instructions Set Extensions

  • 8/14/2019 Intel Processor Architecture-Core

    86/155

    Copyright 2006, Intel Corporation. All rights reserved.

    2001 PTE Engineering Enabling Conference

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Introduced by Pentium 3 in 1999; now frequently calledSSE-1

    Only new data type supported: 4x32Bit (Single Precision)floating point data

    Some 70 instructions

    Arithmetic, compare, convert operations on SSE SP FP data PACKED, UNPACKED

    Data load/store Prefetch

    Extension of MMX

    Streaming Store (store without using cache in between)

    Intel Software College

    SSE Sample: Branch Removal

  • 8/14/2019 Intel Processor Architecture-Core

    87/155

    87

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    R = (R = (AA

  • 8/14/2019 Intel Processor Architecture-Core

    88/155

    Copyright 2006, Intel Corporation. All rights reserved.

    2001 PTE Engineering Enabling Conference

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Introduced by Intel Pentium4 processor in2000

    Some 140 new instructions

    Added double precision floating point data(2x64Bit) and all related instructions includingconversion

    Again some extensions to MMX

    Added all possible combinations of integer data toSSE ( 1x128, 2x64, 4x32, 8x16, 16x8) and relatedoperations

    Intel Software College

    SIMD Single vs. SIMD Double

  • 8/14/2019 Intel Processor Architecture-Core

    89/155

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    002222232330303131

    SIMD SP FP Operand = 4 Elements

    Element = SP FP Number

    005151525262626363

    SIMD DP FP Operand = 2 Elements

    Element = DP FP Number

    4 x Single Precision:4 x Single Precision:

    SSESSE--11

    2 x Double Precision:2 x Double Precision:

    SSESSE--22

    X3X3 X2X2 X1X1 X0X0

    SS ExponentExponent SignificandSignificand

    X1X1 X0X0

    SS ExponentExponent SignificandSignificand

    00127127

    127127 00

    Intel Software College

    Sample for SSE-2:SIMD Double SIMD Int Conversion

  • 8/14/2019 Intel Processor Architecture-Core

    90/155

    90

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    SIMD Double SIMD Int: conversion to two lower ints, twohigher ints cleared

    x1x1 x0x0

    0000000000 0000000000 (int)x1(int)x1 (int)x0(int)x0

    __m128d x;

    __m128i ix;

    ix = _mm_cvtpd_epi32(x);

    ???????? ???????? ix1ix1 ix0ix0

    (double)x1(double)x1 (double)x0(double)x0

    x = _mm_cvtepi32_pd(ix);

    SIMDSIMD IntInt SIMD Double: conversion fromSIMD Double: conversion from

    two lowertwo lower intintss

    Intel Software College

    FISTTP

    SSE3: No new Data Types but new Instructions

  • 8/14/2019 Intel Processor Architecture-Core

    91/155

    91

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    SIMD FP using AOSformat*

    ThreadSynchronization

    Video encoding

    Complex arithmetic

    FP to integerconversions

    HADDPD, HSUBPD

    HADDPS, HSUBPS

    MONITOR, MWAIT

    LDDQU

    ADDSUBPD, ADDSUBPS,

    MOVDDUP, MOVSHDUP,

    MOVSLDUP

    FISTTP

    * Also benefits Complex and Vectorization

    Intel Software College

    Streaming SIMD Extensions 313 new instructions

  • 8/14/2019 Intel Processor Architecture-Core

    92/155

    92

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Three have limited use for application performanceimprovement

    FISTTP - X87 to integer conversion (requires longdouble switch)

    MONITOR/MWAIT - thread synchronization

    Available today in Ring 0 only; being used by newer Windows* and Linux*thread packages

    The other ten have some potential for specifcapplication domains

    Intel Software College

    SSE-3 Sample Complex Arithmetic: ADDSUBPS

    ADDSUBPS OperandA OperandB

  • 8/14/2019 Intel Processor Architecture-Core

    93/155

    93

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    ADDSUBPS OperandA OperandB OperandA (xmm register; 4 data elements)

    a3, a2, a1, a0

    OperandB (xmm reg. Or memory addr; 4 data elements)

    b3, b2, b1, b0

    Result (Stored in OperandA)

    a3+b3, a2-b2, a1+b1, a0-b0

    __m128 _mm_addsub_ps(__m128 a, __m128 b)

    a3 a2 a1 a0

    a3+b3 a2-b2 a1+b1 a0-b0

    Add Sub

    b3 b2 b1 b0

    AddSub

    Intel Software College

    Sample SSSE-3 Inst.: Byte Permute

    PSHUFB mm mm/m64

  • 8/14/2019 Intel Processor Architecture-Core

    94/155

    94

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    PSHUFB mm, mm/m64PSHUFB xmm, xmm/m128

    A complete byte-granularity permutation

    The source operand is used as the control field (variable control)

    The destination operand gets permuted Each byte of the source field selects the origin of the corresponding

    destination byte

    Also includes force-byte-to-zero flag (bit 7)

    0x04 0x01 0x07 0x03 0x02 0x02 0xFF 0x01

    0x7 0x7 0xFF 0x80 0x01 0x00 0x00 0x00

    0x04 0x04 0x00 0x00 0xFF 0x01 0x01 0x01

    srcsrc

    destdest

    destdest

    Intel Software College

    Ways to SSE/SIMD programming

    Coding using SSE/SSE2/3/4 assembler instructions

  • 8/14/2019 Intel Processor Architecture-Core

    95/155

    Copyright 2006, Intel Corporation. All rights reserved.

    2001 PTE Engineering Enabling Conference

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Coding using SSE/SSE2/3/4 assembler instructions Very tedious (manually schedule) discouraged: Dont do it !

    E.g.: How do you exploit the benefits of having now 16 instead of8 SSE registers for Intel 64 without maintaining two versions ?

    Intel compilers C/C++ SIMD intrinsics No need to take care of register allocation, scheduling etc

    Intel compilers C++ Vector Class Library

    Use this if you are heavy into C++ classes

    Vectorizer of Intel C++ and Fortran Compilers Recommended for most cases easy and efficient

    Use ready-to-go vectorized code from a library likeIntel Math Kernel Library (MKL)

    Intel Software CollegeCompiler Based VectorizationProcessor Specific

    Linux*Generate Code and Optimize for

  • 8/14/2019 Intel Processor Architecture-Core

    96/155

    96

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    -xP,-axP

    Intelprocessors with SSE3 capability including Pentium 4 (both 32 and 64bitmode) including code generation for MMX, SSE, SSE2 and SSE-3

    -xN-axN

    Pentium 4 processors in 32, including code generation for MMX, SSE and SSE2- depreciated switch: use xW instead

    -axK-axK

    Pentium 3 compatible and Athlon XPprocessors including code generation forMMX and SSE

    -xW-axW

    Pentium 4 compatible, Athlon 64, Opteron processors in 32 and 64 bit mode,including code generation for MMX, SSE and SSE2

    -xT,-axT

    Intelprocessors with MNI capability IntelCore2 Duo processors (

    Conroe, Merom, Woodcrest) including code generation for MMX, SSE, SSE2, SSE-3 and MNI

    -xB

    -axB

    Pentium M processors including code generation for MMX, SSE and SSE-2

    Intel Software College

    Intel Core Micro-architecture NotableFeatures (cont.) New Instructions

    DescriptionInstruction name

    ReturnReturn

  • 8/14/2019 Intel Processor Architecture-Core

    97/155

    97

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Extract any continuous 16 (8 in the 64 bitcase) bytes from the pair [dst, src] andstore them to the dst register.

    PALIGNR mm, mm/m64, imm8

    PALIGNR xmm, xmm/m128, imm8

    A complete byte-granularity permutation,including force-to-zero flag.

    PSHUFB mm, mm/m64PSHUFB xmm, xmm/m128

    Signed 16 bits multiply, return high bits.PMULHRSW mm, mm/m64

    PMULHRSW xmm, xmm/m128

    Multiply signed & unsigned bytes.Accumulate result to signed-words.(Multiply Accumulate)

    PMADDUBSW mm, mm/m64

    PMADDUBSW xmm, xmm/m128

    Pairwise integer horizontal subtract + pack.phsubw/d/sw mm, mm/m64

    phsubw/d/sw xmm, xmm/m128

    Pairwise integer horizontal addition + pack.phaddw/d/sw mm, mm/m64

    phaddw/d/sw xmm, xmm/m128

    Per element, overwrite destination withabsolute value of source.

    pabsb/w/d mm, mm/m64

    pabsb/w/d xmm, xmm/m128

    Per element, if the source operand isnegative, multiply the destination operandby -1.

    psignb/w/d mm, mm/m64

    psignb/w/d xmm, xmm/m128

    p

    Intel Software College

    Dependencies and Bypasses

    Read-after-Write Dependency - 1 clock stall assuming

  • 8/14/2019 Intel Processor Architecture-Core

    98/155

    98

    Copyright 2006, Intel Corporation. All rights reserved.

    Intel Processor Micro-architecture - Core microarchitecture

    Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

    Read after Write Dependency 1 clock stall assumingregister file can be written-through

    add eax, ecx eax F D E W

    sub ebx, eax ebx F D D E W

    E to D Bypass - save clock penaltyadd eax, ecx eax F D E W

    sub ebx, eax ebx F D E W

    Long Latency operations

    Load [ecx+edi] eax F D E E E Wadd ebx, eax ebx F D D D E W

    Intel Software College

    Fighting Stalls: Branch Handling

    Gi en the code

  • 8/14/2019 Intel Processor Ar