Intel Cornelius

Embed Size (px)

Citation preview

  • 8/13/2019 Intel Cornelius

    1/125

    InteInte ll ItaniumItanium

    ArchitectureArchitecture

    28-Jan-2003

    Herbert CorneliusTechnical Marketing Manager

    Intel EMEA, [email protected]

  • 8/13/2019 Intel Cornelius

    2/125

    2

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Useful URLs

    Intel Itanium 2 Processor:- www.intel.com/products/server/processors/server/itanium2/index.htm

    Intel Software Products:- www.intel.com/products/software/

    Intel Developer Services:- www.intel.com/ids/

    Intel Technology Journal:- www.intel.com/technology/itj/index.htm

    High-Performance Computing:- www.intel.com/ebusiness/trends/hpc.htm

  • 8/13/2019 Intel Cornelius

    3/125

    3

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Agenda

    Intel Itanium Architecture Intel Itanium Processor

    Intel Itanium 2 Processor Platforms Software Tools

    Some Tuning Tips

  • 8/13/2019 Intel Cornelius

    4/125

    4

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

  • 8/13/2019 Intel Cornelius

    5/125

    5

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

  • 8/13/2019 Intel Cornelius

    6/125

    6

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Extending Intel Architecture

    All dates specified are target dates provided for planning purposes only and are subject to change. ( **Codename) P e r f o r m a n c e ,

    s c a

    l a b i l i t y , m

    i s s

    i o n c r i

    t i c a l

    Madison**(Perf)

    Madison**Madison**((Perf Perf ))

    Deerfield**(Price/Perf)

    Deerfield**Deerfield**(Price/(Price/ Perf Perf ))

    0200 01

    . .

    . .

    . .

    . .

    . .

    . .

    . .

    . .

    OutstandingPerformance for

    Volume Applications

    Extends IA for the MostDemanding Applications

    (IA(IA --32)32)

    03

    Gallatin**

    Gallatin**Gallatin**

  • 8/13/2019 Intel Cornelius

    7/1257

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Intel Itanium Processor

    First Implementation of the

    Intel Itanium Architectureusing innovative EPIC** Technology

    **Explicit Parallel Instruction Computing

  • 8/13/2019 Intel Cornelius

    8/125

    8

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Intel Itanium 2 Processor

    Second Generation of theIntel Itanium Architecture

    using an enhanced Micro-Architecture

  • 8/13/2019 Intel Cornelius

    9/125

    9

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Product Features

    400 MHz, 128-bit wide 6.4 GB/s bandwidth

    System Bus

    IntelE8870 chipset OEM custom chipsets

    Chipset

    Based on EPIC architecture Enhanced Machine Check Architecture (MCA)

    with extensive Error Correcting Code (ECC) Operating system support: HP-UX*, Linux*,

    Windows*

    Features

    Level 3: integrated 3 MB or 1.5 MB

    Level 2: 256 KB Level 1: 32 KB

    Cache

    1GHz 900MHz

    Available Speeds

    DescriptionFeature

  • 8/13/2019 Intel Cornelius

    10/125

    10

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium 2 Block Diagram

    Schematic overview

  • 8/13/2019 Intel Cornelius

    11/125

    11

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium 2 SystemsHigh-end Itanium 2-based systems

    >2X more than Itanium !

    Racksaver DP/1U

    1H 2003

    Intel4P/4U2P/ 2U

    Q4 2002/ Q2 2003

    Unisys16P

    Q4 2002

    NEC32P

    Shipping

    SGI64/512P

    Early 2003

    IBM4P/8P/16P

    Early 2003

    HPDP/2UShipping

    HP 2P WSShipping

  • 8/13/2019 Intel Cornelius

    12/125

    12

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Initial Itanium 2 Application Areas

    Enterprise solutions deployed on Itanium 2based systems focus on the following:

    Applications for Business Intelligence

    Mechanical ComputerAided Engineering (MCAE) Electronic Design Automation (EDA) Computeintensive custom applications Enterprise Resource Planning (ERP) Supply Chain Management (SCM) High Performance Computing (HPC) Large databases Security transactions

  • 8/13/2019 Intel Cornelius

    13/125

    13

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium Application Areas

    Large Memory Needs(>4GB direct memory access)

    Large SMP Systems Complex high-end F.P. Apps 64-bit Integer Applications Customized Applications

    Vector and Parallel Applications Enterprise Unix* Needs

  • 8/13/2019 Intel Cornelius

    14/125

    14

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium 2 ProcessorMicro-Architecture Enhancements

    Itanium 2 processor builds on Itaniumprocessor features

    Increased Clock Frequency Shorter Pipeline Expanded Functional Units Faster Floating Point Improved Cache Greater addressability

    Enhanced TLB and ALAT Improved System Bus Long Branch Instruction Enhanced Thermal Management

  • 8/13/2019 Intel Cornelius

    15/125

    15

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    308,620 tpmC at $14.96/tpmC32-way server TPC transactions

    13940 MFLOPSLinpack-10K (4-way system)

    Performance Number Benchmark

    40,621 tpmC at $5.72/tpmC2-way server TPC-C transactions

    101770 MFLOPSLinpack-HPC (32-way system)

    1520 simultaneous connectionsSPECweb99*_SSL

    80,495 tpmC at $4.83/tpmC4-way server TPC-C transactions

    600 SD usersSAP 2-tier SD 4-way server

    3534 MFLOPSLinpack-1000 (single processor)3700 MB/sStream TRIAD

    1356SPECfp*_base2000

    810SPECint*_base2000

    Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance ofIntelproducts as measured by those tests. Any difference in system hardware or software design or configuration may affect actualperformance. Buyers should consult other sources of information to evaluate the performance of systems or components they are consideringpurchasing. For more information on performance tests and on theperformance of Intel products, referencehttp://www.intel.com/procs/perf/limits.htm or call (U.S.) 1-800-628-8686 or 1-916-356-3104.

    Performance Data

  • 8/13/2019 Intel Cornelius

    16/125

    16

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium 2 ProcessorRecord setting Performance

    1 Source: Itanium 2 processor results measured onHP Server rx5670 using 4 Itanium 2 processors1GHz with integrated 3MB L3 cache, 24GB ofmemory, 528GB disk space, HP-UX 11.23, SAP rev4.6D, Oracle 9i V.2

    2 Source www.tpc.org: Itanium 2 processormeasurements done on a HP Server rx5670 using 4Itanium 2 processors 1GHz with integrated 3MB L3cache, 48GB memory, HP-UX 11.23, Oracli 9iV.2, at$4.83 per tpmC

    3 Source: Itanium 2 processormeasurements done on a NEC ServerTX7/i9510 using 32 Itanium 2 processors1GHz with integrated 3MB L3 cache, 128GBmemory, Linux OS.

    5 Source: Itanium 2 processor measurementsdone on a SGI Scalable Linux System using 64Itanium 2 processors, 128GB memory, LinuxOS.

    Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Anydifference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems orcomponents they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/procs/perf/limits.htm or call (U.S.) 1-800-

    628-8686 or 1-916-356-3104

    BENCHMARK

    SCALE

    RESULT

    SAP (2 Tier)SAP (2 Tier) 11Sales andSales and

    DistributionDistribution

    600600USERSUSERS

    WORLDRECORDWORLDWORLD

    RECORDRECORD

    4 4 - - w ay w ay

    TPCTPC -- CC22TransactionTransactionProcessingProcessing

    80.4K80.4KtpmCtpmC

    WORLDRECORDWORLDWORLD

    RECORDRECORD

    4 4 - - w ay w ay

    Linpack3

    HighPerformance

    Computing

    101GFLOPS

    WORLDRECORDWORLDWORLD

    RECORDRECORD

    32-way

    TPCTPC -- CC44TransactionTransactionProcessingProcessing

    308K308KtpmCtpmC

    IA SMPRECORDIA SMPIA SMP

    RECORDRECORD

    32 32 - - w ay w ay

    Stream 5Platform

    Bandwidth

    120GB/sec

    WORLDRECORDWORLDWORLD

    RECORDRECORD

    64-way

    4 Source: Itanium 2 processor measurements done on aNEC TX7/i9510 Server using 32 Itanium 2 processors withintegrated 3MB L3 cache, 256GB memory, Windows .NETServer 2003, Datacenter Edition, Microsoft SQL Server 2000Enterprise Edition (64-bit) beta version, Availability date12/31/02.

  • 8/13/2019 Intel Cornelius

    17/125

    17

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Intel Itanium Processor Family

    800MHz

    4MB L3-Cache460GX Chip-setOEM Chip-sets180nm

    1GHz

    3MB iL3-CacheE8870 Chip-setOEM Chip-sets180nm

    1.5GHz

    6MB iL3-CacheE8870 Chip-setOEM Chip-sets130nm

    >1.5GHz

    larger L3-CacheEnhanced Dual-CoreE8870 Chip-setOEM Chip-sets90nm

    Madison** Montecito**

    **codename

    2001 2002 2003 2005

    All dates specified are target dates, are provided for planning purposes only and are subject to change

    common platform

    Enhanced Core

    2004

    >1.5GHz

    9MB iL3-CacheE8870 Chip-setOEM Chip-sets130nm

  • 8/13/2019 Intel Cornelius

    18/125

    18

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    A new Architecture for

    Business Computing

    RISCTechnology

    CISCTechnology

    New Architectural features EPIC Predication Speculation

    Enhanced floating pointperformance Massive Resources 64-bit instruction set, registers

    & addressing

    Enhancedreliabilityfeatures

    IA-32 Enterprise classOS

  • 8/13/2019 Intel Cornelius

    19/125

    19

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    64-Bit

    Is it new ? Is it good or bad ?IA-32 already has 64-bit and more- 64-bit buses- 64-bit F.P. with 80-bit registers

    - 64-bit Integer- 64/128-bit MMX/XMM registers- but only 32-bit address registers

    Itanium has 64-bit address HW- It is one of many features

    How fast and how many data can you transfer/store- 32-bit data items- 64-bit data items

  • 8/13/2019 Intel Cornelius

    20/125

    20

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    64-Bit Addressing

    32-bit Addressing- 1 cm- one CD cover height

    64-bit Addressing- 429496 km- distance betweenEarth and Moon

    32-bit .

    64-bit

    l

  • 8/13/2019 Intel Cornelius

    21/125

    21

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium Processor ArchitectureSelected Features

    64-bit Addressing Flat Memory Model Instruction Level Parallelism (6-way) Large Register Files Automatic Register Stack Engine

    Predication Software Pipelining Support Register Rotation Loop Control Hardware Sophisticated Branch Architecture

    Control & Data Speculation Powerful 64-bit Integer Architecture Advanced 82-bit Floating Point Architecture Multimedia Support (MMX Technology)

    EMEA HPTC Vi l T

  • 8/13/2019 Intel Cornelius

    22/125

    22

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    User Benefits

    More Capacity and Capabilityl Big in-memory data structures and DB

    l Large file system and data files

    l Efficient large integer calculations

    l Fast 64-bit F.P. calculations

    l Fast Security processing

    l More and faster transactions

    l More servicesl Higher throughput

    l Improved availability and manageability

    EMEA HPTC Vi l T

    I l I i A hi

  • 8/13/2019 Intel Cornelius

    23/125

    23

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Broad Industry Investment

    ~20 OEMs worldwide shipping Itanium based systems today with 2X growth inhigh-end systems (8-32P+) expected withItanium 2 processor

    7 operating system versions available todayfrom Windows* to HP-UX* and Linux, withmore versions coming in 03/04

    More than 100 applications/tools availabletoday with 100s more in development forhigh-end enterprise and technical computing

    (Founder)(Founder)

    (Langchao)(Langchao)

    OEMs

    OpenVMS OpenVMS ,,NNonStop onStop Kernel Kernel

    OSVs

    ISVs

    Itanium Architecture has established broad industryinvestment providing solution choice to high-end computing

    EMEA HPTC Vi l T

    I l I i A hi

  • 8/13/2019 Intel Cornelius

    24/125

    24

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Linux* Supercomputer

    1,400 next-generationIntel Itanium FamilyProcessors that are code-named McKinley andMadison, the new HPsupercomputer will have anexpected total peakperformance of more than8.3 teraflops.

    April 16, 2002

    http:/ /www.pnl.gov/news/2002/computer.htm

    EMEA HPTC Vi t l T

    I t l It i A hit t

  • 8/13/2019 Intel Cornelius

    25/125

    25

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Performance ScalingItanium 2 running Itanium processor binaries

    L i n p a c

    k 1 0 0 0

    S e c u r i t y 1

    L i n p a c

    k 1 0 0 0 0

    - 4 P

    S p e c I

    n t 2 0 0 0

    S p e c F

    p 2 0 0 0

    C A E E R P S e c

    u r i t y 2

    S p e c J

    B B 2 0 0 0

    I M D B

    Performance tests and ratings are measured using specific comput er systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in systemhardware or software design or configuration may affect actual p erformance. Buyers should consult other sources of information to evaluate the performance of systems or component s they are considering

    purchasing. For more information on performance tests and on theperformance of Intel products, reference www.intel.com/procs/perf/limits.htm or call (U.S.) 1-800-628-8686 or 1-916-356-3104

    G A M E S

    S

    Performance Scaling %Itanium 800MHz/4MB to Itanium 2 1GHz/3MB

    Itanium 2 delivers an average of 1.5-2X performance improvement

    Source: Intel Labs

    1.00

    1.25

    1.50

    1.75

    2.00

    2.25

    EMEA HPTC Vi t l T

    Intel It ni m Architect re

  • 8/13/2019 Intel Cornelius

    26/125

    26

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium Processor Family Value PropositionIntel Itanium 2 Processor / Intel E8870 Platform Advancements

    PerformancePerformance

    ScalabilityScalability

    Availability Availability

    InvestmentInvestmentProtectionProtection

    ChoiceChoice

    l E8870 chipset scalability port for 8P+ systemsl Cache line size increased to 128 from 64l Support for larger page size (4 GB), addressing (1024 TB)

    l Hot Plug Processor Boards, Memory, I/Ol Fail-over redundancyl Extensive error detection, correction and logging

    l Major OEMs worldwide shipping Itanium-based systemsl Support from broad list of leading OSVsl S/W application and platform reach expands over time

    l Platform compatible w/ future Itanium processorsl Compatible with Itanium-based OS/ softwarel Common set of S/W tools for Itanium processor family

    l Up to ~1.5-2X performance increase over Itanium proc.l 3X increase in FSB bandwidthl 2X improvement in cache latencies

    EMEA HPTC Vi t l T

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    27/125

    27

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Intel Itanium Architecture

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    28/125

    28

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Fundamental Architecture Challenges Sequentiality inherent in traditional architectures Complex hardware needed to (re)extract ILP Limited ILP available within basic blocks Branches make extracting ILP difficult Memory dependencies further limit ILP Increasing latency exacerbates ILP need Limited resources : A fundamental constraint Shared resources create more overhead Loop ILP extraction costs code size

    And the challenges continue ...

    Itanium architecture overcomes thesefundamental challenges!

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    29/125

    29

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium ArchitecturePerformance Features

    Parallelism - inherent in Itaniums EPIC architecture Frees up hardware for parallel execution Predication reduces branches, enhances ILP Control Speculation breaks branch barrier, enhances ILP Data Speculation breaks data dependence, increases ILP Control and Data Specn address memory latency Itanium arch has abundant reg & mem resources Stack/ RSE reduces call overhead and management

    Loop support yields performance w/o overhead And the performance features continue ...

    Itanium Architecture : Beyond RISC

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    30/125

    30

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium Processor Block Diagram

    (schematic overview)

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    31/125

    31

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Instruction 241 bits

    Instruction 141 bits

    Instruction 041 bits

    Template5 bits

    128 bits (bundle)

    Basis for increased parallelism

    M=MemoryF=Floating-pointI=Integer L=Long Immed.B=Branch

    (MMI)Memory (M)Memory (M)e.g. Integer (I)

    Itanium Architecture:Explicitly Parallel

    Template specifies instruction types MFI, MMI, MII, MLX, MIB, MMF, MFB, MMB, MBB, BBB

    Stops specify group breaks (dependencies) Intra-bundle (M;;MI or MI;;I) and Inter-bundle stop

    Most common template combinations covered

    Headroom for additional templates Simplifies hardware requirements Scales compatibly to future generations

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    32/125

    32

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    EPIC ( Explicit Parallel Instruction Computing)

    Source Cod e

    InstructionBundles

    (3 Instr. each,

    128 bit wide)

    Instruction Groups(series of bundles)

    Up to 6 instructions executed per clock

    M i chael S.Schlan sker, B.Rama kr i shna Rau: EPIC: Expli cit Parall el I nstr ucti onComputing; I EEE Comp ut er, February 2000, pp.37-45

    Instructions

    Compiler

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    33/125

    33

    EMEA HPTC Virtual TeamIntel Itanium Architecture

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    M F I M F I

    Load 4 DP (8 SP) opsvia 2 ld-pair

    2 ALU ops (post++)

    4 DP FLOPS

    (8 SP FLOPS)

    2 ALU ops

    6 instructionsprovides12 parallel ops/clock (SP: 20 parallel ops/clock)for digital content creation& scientific computing

    2 Loads +2 ALU ops (post++)

    M I B M I B

    2 ALU ops 1 Branch Hint +1 Branch instr

    6 instructionsprovides8 parallel ops / clock for enterprise &Internet applications

    Itanium processor delivers greater ILPthan any contemporary processor

    Breakthrough Parallelism

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    34/125

    34

    EMEA HPTC Virtual Team

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Floating-Point:High performance and High precision

    Floating-Point Architecture

    Fused Multiply Add Operation An efficient core computation unit

    Abundant Register resources 128 registers (32 static, 96 rotating)

    High Precision Data computations 82-bit unified internal format for all data types

    Software divide/square-root High throughput achieved via pipelining

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    35/125

    35

    EMEA HPTC Virtual Team

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Floating Point Featuresl Native 82-bit hardware provides support for multiple numeric modelsl 2 Extended precision pipelined FMACs deliver 4 EP / DP FLOPs/cyclel Performance for security, efficient use of hardware: Integer mul-add, s/w dividel Balanced with plenty of operand bandwidth from registers / memory

    6 x 82-bit operands

    L2L2CacheCache

    128 entry128 entry8282 --bitbit

    RFRF

    2 x 82-bit results

    4Mbyte4MbyteL3L3

    CacheCache

    2 stores/clk

    2 DPOps/clk

    4 DPOps/clk

    (2 x Fld-pair)

    odd

    even

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    36/125

    36

    EMEA HPTC Virtual Team

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Parallel, deep, and dynamic pipelinedesigned for maximum throughput

    Itanium Processor Pipeline

    6-Wide EPIC hardware under compiler control Parallel hardware and control for predication & speculation Efficient mechanism for enabling register stacking & rotation Software-enhanced branch prediction

    10-stage in-order pipeline designed for: Single cycle ALU (4 ALUs globally bypassed) Low latency from data cache

    Dynamic support for run-time optimization Decoupled front end with prefetch to hide fetch latency Aggressive branch prediction to reduce branch penalty

    Non-blocking caches, register scoreboard to hide load latency

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    37/125

    37

    EMEA HPTC Virtual Team

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    PredicationControl Flow to Data Flow

    Traditional Arch.

    then

    else

    br cmp

    br

    cmp p1,p2p2

    p2

    p1

    p1

    Itanium Architcteureif if

    Removes/Reduces Branches andEnables Parallel Execution

    64 predicate registers

    Can be combined with logical ops

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    38/125

    38

    EMEA HPTC Virtual Team

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Loop support: ILP+++, Overhead---

    Software Pipelining Support

    High performance loops withoutcode size overhead

    No prologue/epilogue Register rotation (rrb) Predication

    Loop control registers (LC, EC) Loop branches (br.ctop,br.wtop) Especially valuable for integer loops

    with small trip counts

    Whole loop computation in parallel

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    39/125

    39

    EMEA HPTC Virtual Team

    Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Software Pipelining (cont.)

    Traditional architectures use loop unrolling Results in code expansion and increased cache misses

    Itanium-Processor Software Pipelining uses rotatingregisters Allows overlapping execution of multiple loop instances

    Predication controls the pipeline stages

    Sequential Loop

    T i m e

    Software-Pipelined Loop

    T i m e

    loadload

    computecompute

    storestore

  • 8/13/2019 Intel Cornelius

    40/125

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    41/125

    41Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Register Rotation GR32-127 and FR32-127 can rotate (specified range)

    Separate rotating register base for each set (GR, FR) Loop branches decrement all register rotating bases (RRB) Instructions contain a virtual register number

    physical register # = RRB + virtual register #

    i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7

    same

    phy.reg.

    Predicate register range also rotates.diff.

    virtualnumber

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    42/125

    42Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Control & Data Speculation

    Control Speculationmoves loads above

    branches / calls

    Barrier instr. 2

    ld r1=use = r1use = r1

    branch st[?]

    instr. 1instr. 2instr. 1

    ld r1=

    Barrier

    Data Speculation movesloads above possibly

    conflicting stores

    Speculation reduces the impactof memory latency

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    43/125

    43Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Control Speculation

    Control Speculation moves loads above branches Detected exception indicated using NaT bit / NaTVal

    Check raises detected exceptions Branch barrier broken to minimize memory latency

    Barrier instr. 2

    chk.s r1use = r1use = r1

    ld.s r1=

    branch branch

    instr. 1instr. 2instr. 1

    ld r1=

    Itanium Traditional Arch. Detect exception

    Deliver exception

    P r o p a g a t e e x c e p t i o n

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    44/125

    44Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Hoisting Uses

    Barrier instr. 2

    chk.s r1use = r1use = r1

    ld.s r1=

    branch branch

    instr. 1

    instr. 2instr. 1

    ld r1=

    ItaniumItanium

    Traditional Arch.use = r1

    Recovery code

    Speculativeuse

    ld r1=

    branch

    All computation instructions propagate NaTs to reducenumber of checks to allow single check on results

    Compares also propagates when writing predicates

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    45/125

    45Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Data Speculation

    Barrier instr. 2

    ld.c r1use = r1use = r1

    ld.a r1=

    st[?] st[?]

    instr. 1instr. 2instr. 1

    ld r1=

    Itanium Traditional Arch.

    Data Speculation moves loads above possiblyconflicting stores

    - Keeps track of load addresses used in advance (ALAT)

    Advanced-loaded data can be used speculatively

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    46/125

    46Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Advanced Load Address Table: ALAT

    ld.a inserts entries Conflicting stores remove entries

    also ld.c.clr, chk.a.clr

    Presence of entry indicates success chk.a branches when no entry is found

    reg#reg#reg#

    reg#

    ::

    addr addr addr

    addr

    ::

    ld.a reg# =

    chk.a reg# ?

    st[addr]

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    47/125

    47Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Hoisting Uses

    Barrier

    instr. 2

    chk.a r1use = r1use = r1

    ld.a r1=

    st[?] st[?]

    instr. 1

    instr. 2instr. 1

    ld r1=

    Itanium Traditional Arch.

    Data and Control Speculationcan be combined

    use = r1

    Recovery code

    Speculativeuse

    ld r1=

    branch

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    48/125

    48Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Intel Itanium 2 Processor

    Architecture

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    49/125

    49Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Intel Itanium 2 Processor

    Codename McKinley Target for 2H2002 Enhanced Itanium design 100% Itanium binary compatible

    1.0GHz clock-rate 6 Integer units 256KB L2 cache 1.5MB or 3MB iL3 cache

    6.4GB/s system bus 1.5-2x Performance increase overItanium based systems

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    50/125

    50Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium 2 Optimizations

    Improved dynamic properties Production frequency is 1 GHz Reduced L1, L2, L3 latencies

    L3 cache has been incorporated on die Improved L2 cache capacity Improved FSB bandwidth Lower branch prediction penalties

    Itanium 2 provides significant speed-ups onexisting Itanium processor binaries

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    51/125

    51Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium 2 Optimizations

    Reduced execution paths More parallelism/resources

    More integer, multi-media units and memory ports

    Short latencies Fully bypassed functional units Very Low L1D/L2/L3 Cache Latencies Low latency FP execution

    Many more ways to issue/execute 6 insts/clk

    Itanium 2 provides performance headroom forre-optimized binaries

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    52/125

    52Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    System Bus64 bits wide133MHz/266 MT/s2.1 GB/s

    Width2 bundles per clock4 integer units2 load or stores per clock

    9 issue ports

    CachesL1 2X16KB - 2 clock latencyL2 96K 9 clock latencyL3 - 4MB external 21 clk

    12.8 GB/s bandwidth

    Addressing

    44 bit physical addressing50 bit virtual addressingMaximum page size of 256MB

    System Bus

    Core800 MHz

    L3 Cache BSB

    System Bus128 bits wide200MHz/400 MT/s6.4 GB/s

    Width2 bundles per clock6 integer units2 loads and 2 stores per clock

    11 issue ports

    CachesL1 2X16KB - 1 clock latencyL2 256K 5 clock latencyL3 - 3MB 12 clk

    32 GB/s bandwidth

    Addressing

    50 bit physical addressing64 bit virtual addressingMaximum page size of 4GB

    Core1 GHz

    L3 Cache

    System Bus

    Itanium Processor Itanium 2 Processor

    2X

    3X

    1.5X

    2X

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    53/125

    53Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium 2 Processor Block Diagram

    (schematic overview)

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    54/125

    54Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Architectural ChangesBeneficial to compilers

    Improved data/control speculation support. ALAT - fully associative = minimize thrashing. processor directly vectors to recovery code for reducedprocessor speculation costs

    64-bit Long Branch Instruction

    Beneficial to OS and System designs Full 64-bit virtual addressing Full 2**24 virtual address spaces 4GB virtual pages = reduced TLB pressure

    50-bit Physical addressing = very large memory/IO spaces

    More flexibility for compiler, OS and systemdesigns

  • 8/13/2019 Intel Cornelius

    55/125

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    56/125

    56Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    25.6GB/s

    25.6GB/s

    Memory Cache HierarchyItanium 2 Processor (1GHz)

    L1D16KB64B CL1 CLK

    L1I16KB64B CL1 CLK

    L2-Cache256KB128B CL8-way5-7 CLKS

    L3-Cache1.5/3MB128B CL12-way12-15 CLKS

    32GB/s

    6.4 GB/s

    Itanium Processor (800MHz)

    L1D16KB32B CL2 CLK

    L1I16KB32B CL

    2 CLK

    L2-Cache96KB64B CL6-way6-9 CLKS

    2.1 GB/sMemory(Controller)

    32

    GB/s

    32GB/s

    12.8GB/s

    L3-Cache2/4MB64B CL4-way20 CLKS

    Memory(Controller)

    210 CLKS

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    57/125

    57Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium 2 Cache Hierarchy

    3 level caching on Itanium 2 processor 1st level cache optimized for latency 2nd level cache optimized for bandwidth 3rd level cache optimized for size

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    58/125

    58Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Large Register Set

    BR7

    BR0

    Branch Registers

    63 0

    96 Framed, Rotating

    GR1

    GR31

    GR127

    GR32

    GR0NaT

    32 Static

    0

    Integer Registers

    63 0

    PredicateRegisters

    PR1

    PR63

    PR0

    PR15PR16

    48 Rotating16 Static

    96 Rotating

    FR1

    FR31

    FR127

    FR32

    FR0

    32 Static

    + 0.0

    F.P. Registers

    81 0

    + 1.01

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    59/125

    59Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Functional Units

    Itanium Itanium 2

    Integer

    F.P.

    Multimedia

    Load/Store

    Branch

    F.P. MAC

    F.P. MAC

    ALU/INT/MM

    ALU/INT/MM

    ALU/MM/MEM

    ALU/MM/MEM

    ALU/MM/MEM

    ALU/MM/MEM

    BRANCHBRANCH

    BRANCH

    Issue Ports/Units

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    60/125

    60Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium 2 Dispersal Matrix

    Possible Itanium 2 full issuePossible Itanium processor and Itanium 2 full issue

    * hint in first bundleMFB*

    MMB*

    BBB

    MBB

    MIB*

    MMF

    MFI

    MMI

    MLI

    MII

    MFMMBBBBBMBBMIBMMFMFIMMIMLIMII

    Itanium 2 allows more compiler dispersal options

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    61/125

    61Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    A simple Example

    ..double precision, dimension(10000) :: a,b,c,d do i=1,10000

    a(i)=a(i)*b(i)+c(i)*d(i)enddo..

    DAXPY like loop over floating-point vectors can be optimized differently for Itanium

    and Itanium 2

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    62/125

    62Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium vs. Itanium 2 Assembly Code

    3 clockticks on Itanium

    .b1_2:

    { .mmf(p16) ldfd f37=[r8],8

    (p16) ldfd f45=[r3],8(p19) fma.d f52=f40,f48,f0 ;;

    }{ .mmi

    (p16) ldfd f32=[r33](p16) ldfd f40=[r2],8

    nop.i 0 ;;}

    { .mfi(p23) stfd [r40]=f51

    (p20) fma.d f48=f36,f44,f53nop.i 0

    }{ .mib

    (p16) add r32=8,r33nop.i 0

    br.ctop.sptk .b1_2 ;;}

    2 clockticks on Itanium 2 !

    .b1_2:

    { .mfi(p16) ldfd f43=[r8],8

    (p19) fma.d f51=f46,f50,f0nop.i 0

    }{ .mmf

    (p16) ldfd f47=[r3],8(p23) stfd [r32]=f56

    (p21) fma.d f54=f37,f42,f53 ;;}

    { .mii(p16) ldfd f32=[r33]

    nop.i 0nop.i 0

    }{ .mmb

    (p16) ldfd f37=[r2],8(p16) add r32=8,r33

    br.ctop.sptk .b1_2 ;;}

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    63/125

    63Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    6.4 GB/s6.4 GB/s128 bits wide128 bits wide

    400 MHz400 MHz

    Itanium 2 Processor Itanium 2 Processor Itanium Processor Itanium Processor

    1010

    4 Integer,3 Branch

    2 FP,2 SIMD

    2 Loador 2 Store

    1 2 3 4 5 6 7 8 9

    PipelinePipelineStagesStages

    328 on328 on--board Registersboard Registers

    6 Instructions / Cycle6 Instructions / Cycle

    4 MB L3 on board, 96k L2, 32k L1 on4 MB L3 on board, 96k L2, 32k L1 on--di edi e

    2.1 GB/s2.1 GB/s

    64 bits wide64 bits wide266 MHz266 MHz

    800 MHz800 MHz

    IssueIssuePortsPorts

    88

    2 FP,1 SIMD

    2 Load &2 Store

    1 2 3 4 5 6 7 8 9

    328 on328 on--board Registersboard Registers

    6 Instructions / Cycle6 Instructions / Cycle

    3 MB L3, 256k L2, 32k L1 all on3 MB L3, 256k L2, 32k L1 all on--diedie

    1 GHz1 GHz

    1011

    Large onLarge on--die cache,die cache,reduced latencyreduced latency

    IncreasedIncreasedCore frequencyCore frequency

    Additional AdditionalExecution unitsExecution units

    Additional AdditionalIssue portsIssue ports

    3X increase3X increaseSystem bus bandwidthSystem bus bandwidth

    McKinley delivers performance through:McKinley delivers performance through: Bandwidth and cache improvementsBandwidth and cache improvements MicroMicro --architecture enhancementsarchitecture enhancements Increased frequencyIncreased frequency

    System busSystem bus

    Itanium 2221 million transistors total

    25 million in CPU core

    6 Integer,3 Branch

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    64/125

    64Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Architectural ChangesBeneficial to compilers

    Improved data/control speculation support ALAT - fully associative = minimize thrashing processor directly vectors to recovery code for reduced

    speculation costs

    64-bit Long Branch Instruction

    Beneficial to OS and System designs Full 64-bit virtual addressing Full 2**24 virtual address spaces 4GB virtual pages = reduced TLB pressure

    50-bit Physical addressing = very large memory/IO spaces

    Changes provide more flexibility to compiler,OS and system designs

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    65/125

    65Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium 2 Pipelines

    L2 Queue Nominate/Issue (4)L2N-L2IInteger and FP Register File read (6)REG

    Integer and FP Register Rename (6 inst)

    Expand, Port Assignment and Routing

    Instruction Rotate and Buffer (6 inst)

    IP Generate, L1I Cache (6 inst) and TLBaccess

    L2A-W

    FP1-WB

    WB

    DET

    EXE

    L2 Access, Rotate, Correct, Write (4)

    FP FMAC pipeline (2) + reg writeREN

    Writeback, Integer Register updateEXP

    Exception Detect, Branch CorrectionROT

    ALU Execute(6), L1D Cache and TLBaccess + L2 Cache Tag Access(4)

    IPG

    Short 8-stage in-order main pipeline

    In-order issue, out-of-order completion Reduced branch misprediction penalties Fully interlocked, no way-prediction or flush/replay mechanism

    Pipelines are designed for very low latency

    RENEXPROTIPG DET WBEXEREGL2WL2CL2DL2ML2A L2IL2N

    WBFP4FP3FP2FP1FPU

    CoreL2

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    66/125

    66Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium 2 Issue Ports

    Issue ports 4 Mem/ALU/Multi-Media 2 Integer/ALU/Multi-Media 2 FMAC 3 branch

    4 memory ports Integer: allow 2 load AND 2 store per clk FP: 2 FP load pairs AND 2 store per clk to feed 2 FMACs

    L1 instruction cache

    two instructionbundles

    ALU/MEM

    1

    ALU/MEM

    2

    ALU/MEM

    3

    ALUMEM

    4

    six arithmeticlogic units

    two load portstwo store ports(1 cycle latency)

    ALU/INT1

    ALU/INT2

    L1datacache

    Itanium 2

    Substantial performance headroom forFP and integer kernels

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    67/125

    67Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium 2 Unit Latencies

    Consuming Class Instruction

    Producing Class Instruction Integer Multi- Load Storemedia Address Data

    Mem/integer ports ALU 1 2 1 1

    Integer only ports ALU 1 2 1 1

    Multimedia 3 2 3 3

    Integer Loads (L1D hit) 1 2 2 1

    Short latencies and full bypasses, improveperformance for re-optimized code

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    68/125

    68Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Floating Point Latencies

    Short latencies = performance upside for re-optimized FP code

    6INT FP (setf)

    4FMISC5FP INT (getf)

    4FMAC

    6FP Load (L2 Cache hit)

    Itanium 2 LatencyOperation

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    69/125

    69Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Floating Point Architecture

    DIV and SQRT are done in software to enable better ILP full pipelining higher throughput more flexibility support full IEEE.754 compliance versions optimized for latency and throughput also available for SIMD F.P. operations

    Source: Intel Technology Journal Q4, 1999

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    70/125

    70Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Floating-Point DIV ThroughputOptimized

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    71/125

    71Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Floating-Point SQRT ThroughputOptimized

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    72/125

    72Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Integer DIV

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    73/125

    73Copyright 2002-2003 Intel Corporation*Other brands and names are the property of their respective owners

    Itanium 2 Branch PredictionZero clock branch prediction

    2 level branch prediction hierarchy L1IBR Level 1 Branch Cache

    Part of the L1 I-cache 1K trigger predictions+0.5K target addresses

    L2B - Level 2 Branch Cache (12K histories) PHT - Pattern History Table (16K counters)Reduced prediction penalties

    IP-relative branch w/correct prediction - 0 cycle IP-relative branch w/wrong target - 1 cycle Return branch w/correct prediction - 1 cycle

    Last branch in counted loop prediction - 0 cycle Branch Misprediction 6 cycle

    Reduced branch penalties speed up existing code

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    74/125

    74Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Instruction PrefetchingStreaming prefetching

    Initiated by br.many (hint on branch inst) CPU prefetches ahead the sequential execution stream Streaming prefetch is cancelled by:

    a predicted-taken branch in the front-end

    a branch misprediction occurs on the back-end Software cancels the prefetch with a brp instruction

    Branch Prefetching Hints Initiated by brp.few, brp.many or mov_to_br

    One time prefetch for the target Two hint prefetches can be initiated per cycle

    Software initiated instruction prefetching improvesperformance by lower instruction fetch penalties

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    75/125

    75Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Itanium 2 Caches

    R: 32 GBs

    W: 32 GBs

    R: 32 GBs

    W: 32 GBs

    R: 16 GBs

    W: 16 GBs

    R: 32 GBsBandwidth

    WB (WA)WB (WA+ RA)

    WT (RA)-Write Policy

    12INT: 5FP: 6

    INT:1I-Fetch:1Latency(load to use)

    NRUNRUNRULRUReplacement

    12844Ways

    128B128B64B64BLine Size

    3M on die256K 16K 16K Size

    L3L2L1DL1I

    All caches are physically indexed, pipelined, and non-blocking:score boarded registers allow continued execution until load use

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    76/125

    76Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    L1D (1 clock Integer Data Cache)

    High Performance 32GB/s, 2 ld AND 2 st ports Write Through all stores are pushed to the L2 FP loads force miss, FP stores invalidate True dual-ported read access no load conflicts pseudo-dual store port write access

    2 store coalescing buffers/port hold data until L1D update

    Store to load forwarding

    One clock data cache provides a significantperformance benefit

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    77/125

    77Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    L2 and L3 CacheL2 256KB, 32GB/s, 5 clk

    Data array is pseudo-4 ported - 16 banks of 16KB eachNon-blocking/out-of-order

    L2 queue (32 entries) - holds all in-flight load/stores out-of-order service - smoothes over load/store/bank conflicts, fills Can issue/retire 4 stores/loads per clock Can bypass L2 queue (5,7,9 clk bypass) if

    no address or bank conflicts in same issue group

    no prior ops in L2 queue want access to L2 data arrays

    Large iL3 3MB, 32GB/s, 12 clk cache on die !! Single ported full cache line transfers

    Large on die L2 and L3 cache provides significantperformance potential

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    78/125

    78Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    TLBs

    2-level TLB hierarchy DTC/ITC (32/32 entry, fully associative, .5 clk)

    Small fast translation caches tied to L1D/L1I Key to achieving very fast 1-clk L1D, L1I cache accesses

    DTLB/ITLB (128/128 entry, fully associative, 1 clk) All architected page sizes (4K to 4GB) Supports up to 64/64 ITR/DTRs

    TLB miss starts hardware page walker

    Small fast TLBs enable low latency caches

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    79/125

    79Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    System Bus Enhancements

    Extension of the Itanium processor bus Same protocol with minor extensions Increased to 6.4GB/s bandwidth

    frequency 200MHz, 400MHz data, 128-bit data bus Bus is non-blocking and out of order

    Most transactions can be deferred for later service Buffering

    18 bus requests/CPU are allowed to be outstanding 16 Read Line + 6 Write Line + two 128 byte WC buffers

    Itanium 2 significantly extends the system busperformance level

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    80/125

    80Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    New Bus Transactions

    L3 cast-outs (Normally silent L3 replacement (E->I, S->I)) Reduces snoop traffic in Directory based systems Backward inquiry for L2, L1 coherency

    Memory read current

    non-destructive (non-coherent) snoop of CPU lines Used in high bandwidth graphic based systems

    Cache Cleanse writes all modified lines to memory M->E, Used in fault tolerant systems invoked via PAL

    Itanium 2 provides several new bus transactionsto improve performance/reliability

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    E F t

  • 8/13/2019 Intel Cornelius

    81/125

    81Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Error FeaturesError detection on all major arrays

    Parity coverage on L1D, L1I, and TLBs ECC on L2 and L3

    double bit detection single bit correction - Out of path repair all errors are fully contained

    Bus is covered with parity/ECC double bit detection single bit correction on transmission Error Isolation (end-to-end error detection)

    From memory: unique FSB 2xECC syndrome encoding cantolerant additional single bit errors in transmission

    Error not reported until referenced by a consuming process

    Itanium 2 provides extensive errordetection/correction/containment

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    82/125

    82Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Thermal Management

    Programmable fail-safe thermal trip Itanium 2 will reduce power consumption Reduce power consumption to ~60% of peak Execution rate dropped to 1 inst per clock Correct Machine Check notification posted to OS Full speed execution resumes when temperature

    drops

    never invoked in properly designed andoperating cooling systems

    even on worse case power code

    Itanium 2 provides a thermal fail-safemechanism in the event of a cooling failure

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    83/125

    83Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Itanium 2 Processor

    Itanium 2 builds on and extends the Itanium processorfamily to meet the needs of the most demandingenterprise and technical computing environments Enhanced Itanium 2 features are a result of extensible Itanium

    architecture Itanium 2 is binary compatible with Itanium processor software

    Major enhancements include : Increased frequency Enhanced micro-architecture more execution units, issue ports

    Efficient data handling; higher bandwidth and reduced latencies

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    84/125

    84Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Intel Itanium 2 Processor

    Platforms

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    85/125

    85Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Performance Scaling

    Scale-Out(Cluster)

    Scale-Up(SMP, ccNUMA)

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    86/125

    86Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    DP/1U 4P/4U

    16P

    32P

    64-512P

    4P/8P/16PDP/2U

    Increase Capacity andCapability

    Scaling Out and Scaling Up Scaling Right

    Do more, better and

    faster at lower costs.

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Itanium Processor Family

  • 8/13/2019 Intel Cornelius

    87/125

    87Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Itanium Processor FamilyOEM Server Designs

    4P 8-16P >16P

    17 OEMs Shipping

    >20 OEM Platforms 10 OEM Designs

    4 OEMs Shipping

    6 OEM Designs

    1 OEM Shipping

    Itanium 2(Madison)

    ItaniumProcessor

    Substantial investment by OEMs in custom high-end platforms and growing

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    88/125

    88Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Itanium 2 SystemsHigh-end Itanium 2-based systems

    >2X more than Itanium !

    Racksaver DP/1U1H 2003

    Intel4P/4U2P/ 2U

    Q4 2002/ Q2 2003

    Unisys16PQ4 2002

    NEC32PShipping

    SGI64/512PEarly 2003

    IBM4P/8P/16PEarly 2003

    HPDP/2UShipping

    HP 2P WSShipping

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Itanium 2 based Servers

  • 8/13/2019 Intel Cornelius

    89/125

    89Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Itanium 2-based ServersBringing High-End Capabilities to Intel Architecture

    Large Memory CapacityEx. 4P node w/48GB

    512P+ system w/512GB

    Scalable to High-EndMulti-Processing

    32P+ SMP systems512P+ Clustered configurations

    High-Bandwidth,Flexible I/O

    Large Qty PCI-X slotsDual GbE LANUltra 320 SCSI

    Remote I/O capabilitiesPartitioningMultiple System ImagesStatic/Dynamic Domains

    High-End RASIntelligent Platform

    Management,Hardware redundancy

    for Fault-Tolerance,Modular and Hot-PlugCapabilities

    Selected examples of somehigh-end OEM platformcapabilities. Not all capabilities

    found on all platforms.

    OEMs will offer datacenter computing capabilitieswith their Itanium 2-based servers

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Th Chi

  • 8/13/2019 Intel Cornelius

    90/125

    90Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    The Chipset

    I/OBridge

    Processors

    Memory &I/O

    Controller MemoryBridge

    Memorymodules

    I/ODevices

    Chipset

    The chipset is a key ingredient to platform design and performance

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    91/125

    91Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Optimized for: 1-2P workstations

    2-4P servers

    Designed for great cost &performance

    Great developers desktops

    High-performance clusters

    Features: 6.4 GB/s processor bandwidth

    12.8 GB/s memory bandwidth 4.0 GB/s I/O bandwidth

    Extremely low latency

    Hewlett-Packard zx1 Chipset

    HP zx1memory & I/O

    controller

    HP zx1

    I/Oadapter

    HP zx1scalablememoryexpander

    DIMMs

    HP zx1chipset

    PCI bus

    PCI-X bus AGP bus

    HP scalable processor chipset zx13 modular components

    IntelItanium 2

    processors(1-4)

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    b d

  • 8/13/2019 Intel Cornelius

    92/125

    92Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    HP Itanium 2 Processor basedSystems

    1 GHz Itanium 24-wayHP zx1 chipset

    900MHz/1GHz Itanium 21-2 way HP zx1 chipset

    AGP4X OEM graphics

    900MHz Itanium 21-way HP zx1 chipset

    AGP4X OEM graphics

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    93/125

    93Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Itanium 2 Workstations

    HP zx6000 HP zx2000

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    C i 2 b d S

  • 8/13/2019 Intel Cornelius

    94/125

    94Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    NEC Itanium 2-based Server

    TX7 Series"TX7/i6010,i6510,i9010/i9510"

    LINPACK HPC of 101.77GFLOPS on 32 CPUs

    http://www.nec.co.jp/press/en/0207/0901.html

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Shared Memory via ccNUMA

  • 8/13/2019 Intel Cornelius

    95/125

    95Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Shared Memory via ccNUMAhttp:// www.sgi.com/features/2003/jan/altix /

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Intel E8870 Chip Set

  • 8/13/2019 Intel Cornelius

    96/125

    96Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Intel E8870 Chip-Set

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Intel E8870 Block Diagram

  • 8/13/2019 Intel Cornelius

    97/125

    97Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    System Bus:

    16 Bytes Wide Double Pumped 200MHz/400MT/s 6.4 GB/sec

    Memory:

    Quad MemoryChannels 6.4 GB/sec peak 16 DDR DIMM Sites 32 GB max

    I/O Busses: Hot Plug PCI-X up to 133MHz Direct Attached InfiniBand*

    Hub Interface 2.0 : 4 pt-to-pt Busses 16 Bits Wide A Total of 4 GB/sec

    Scalability Ports: 2 pt-to-pt Connects 16 Bits Wide 6.4 GB/sec Full Duplex

    Intel E8870 Block Diagram

    I n f i n

    i B a n

    d * 1 0 0

    1 0 0

    S P 0

    S P 1

    Data BusSystem Bus

    PCI 32/33

    Video

    Processor

    FWH LPC

    MRHD

    MRHD

    MRHD

    MRHD870SNC

    4 MemoryChannels

    870SIOH

    HL1 @266MB/s266MB/s

    LPCHL2 HL2

    Processor Processor Processor

    1 3 3

    1 3 3

    870P64H2

    870P64H2

    SCSI

    LAN

    FWH

    BMC

    FWH

    1 3 3

    870ICH

    1 0 0

    870VXB

    870P64H2

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Intel E8870 Chipset Architecture

  • 8/13/2019 Intel Cornelius

    98/125

    98Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Intel E8870 Chipset ArchitectureKey Featuresl Open platform architecture

    Efficient use of building blocks End user ease of upgrade value

    l Versatile chipset spanningmultiple segments 4 and 8 way Servers Scalability port building block enables

    up to 512 way configurations

    l Balanced systemperformance Memory, scalability port, I/O bandwidth Maximizes system throughput

    l Persistent/scalableinterfaces Reuse spans processor generations Systems scalability headroom

    l Robust RAS features

    ScalabilityPort Switch

    MemoryMemory MemoryMemory

    ScalabilityNode

    Controller

    I/O Hub

    PCIPCI--(X)(X)BridgeBridge

    LegacyI/O

    PCIPCI--(X)(X)BridgeBridge

    LegacyI/O

    ScalabilityPort Switch

    ScalabilityNode

    Controller

    I/O Hub

    PCIPCI--(X)(X)BridgeBridge

    PCIPCI--(X)(X)BridgeBridge

    Processors Processors

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    4 Way 4U High Performance

  • 8/13/2019 Intel Cornelius

    99/125

    99Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    4-way Itanium 2 Intel E8870 chipset 16 DDR DIMMs (32GB) PCI-X up to 133MHz

    Lower MTBR Tool Less Insertion Extraction Blind Mate Modules No Cable Assembly

    4-Way, 4U, High Performance,Modular Platform

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

  • 8/13/2019 Intel Cornelius

    100/125

    100Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Intel Itanium 2 Processor

    Software Environment

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    C/C++D t M d l

  • 8/13/2019 Intel Cornelius

    101/125

    101Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    C/C++Data Model

    OS Implements the Data ModelsILP32

    int, long and ptr are 32 bits Used by 32-bit OSs

    LP64 int is 32 bits long and pointer are 64 bits Used by 64-bit UNIX/Linux OSs

    P64

    int and long are 32 bits; pointer is 64 bits Used by Win64* and Modesto*

    3232

    3232

    3232

    ILP32ILP32sizesize

    (bits)(bits)

    6464

    3232

    6464

    LP64LP64sizesize

    (bits)(bits)

    3232

    3232

    6464

    P64P64sizesize

    (bits)(bits)

    longlong

    intint

    pointer pointer

    default settingsdefault settings

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    OSV Support for Itanium Processor Family

  • 8/13/2019 Intel Cornelius

    102/125

    102Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    pp y

    OpenVMS OpenVMS NonStop NonStop Kernel, Kernel,

    ConvergedConvergedEnterpr is e UNIX Enterpr is e UNIX

    HP-UX*: Fully supported 1.5release now, Version 1.6update planned for 2H '02

    Red Hat*, SuSE*, Caldera*,Turbolinux* Linux in

    production today

    l Windows* XP 64 bit for1-2 way workstations inproduction today

    l 64-bit version ofWindows* AdvancedServer, Limited Editionavailable for earlyadopters now

    l Windows .Net Serverscheduled for 1H03

    l Port to Itaniumarchitectureunderway

    l Developer versionstarget 03, productionversions in 04

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Software Solution Support

  • 8/13/2019 Intel Cornelius

    103/125

    103Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    High-End Enterprise Applications(Databases, Business Intelligence, ERP / SCM)

    Beta version since Q1 02,focusing on optimization

    Developer version availablesince Q4 01

    DB2 early adopter releaseavailable since 2H 01

    Engaged with early adopterend-users, strong performance

    Production version targets

    mid-02, performance for largedata sets

    Initial porting work complete,optimization on-going

    Future product plans from: Ariba, Autonomy, BEA, BMC Software, Check PointSoftware, Citrix, Commvault, Computer Associates, Covalent, Entrust, IBM WebSphere,Informix, Intershop, JD Edwards, Manugistics, MigraTEC, Network Associates, Nuance,Oasis, Oblix, Openshop, TimesTen, Tivoli Systems, Verisign, Veritas, Zeus

    Software Solution Support

  • 8/13/2019 Intel Cornelius

    104/125

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    I t l S ft T l

  • 8/13/2019 Intel Cornelius

    105/125

    105Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Intel Software Tools

    Optimized for on

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Intel Software Development Tools

  • 8/13/2019 Intel Cornelius

    106/125

    106Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Intel Software Development Tools

    Compilers

    Intel

    ThreadingTools

    VTune

    Performance Analyzer

    PerformanceLibraries

    SW Products Developer Services

    www.intel.com/ids

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Intel Compilers

  • 8/13/2019 Intel Cornelius

    107/125

    107Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Intel Compilers

    Targeted for Intel Architecture basedWindows* and Linux* platforms

    Optimized for the latest Intelmicroprocessors: Intel Pentium 4 Processor Intel Xeon Processor Intel Itanium Processor Intel Itanium 2 Processor

    Auto-vectorization and OpenMP support Integration of CVF technologies in 2003

    http://developer.intel.com/software/products/compilers

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    De elopment Tools for Windo s*

  • 8/13/2019 Intel Cornelius

    108/125

    108Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Development Tools for Windows*

    Compilers- MSFT C/C++ Platform SDK - Intel C/C++- Intel Fortran95

    Performance Tools- Intel IPP Library

    - Intel MKL Library- Intel VTune Performance Analyser- Intel KAI KAP/Pro* Toolset

    Java- IBM JDK - BEA JRockit* JDK - TowerJ*

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Development Tools for Linux*

  • 8/13/2019 Intel Cornelius

    109/125

    109Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Development Tools for Linux*

    Compilers- GNU gcc- Intel C/C++- Intel Fortran95

    Performance Tools- Intel IPP Library- Intel MKL Library- Intel VTune Performance Analyzer Collector- Intel KAI KAP/Pro* Toolset- Linux glibc Library

    Java- IBM JDK

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Intel Software Toolset

  • 8/13/2019 Intel Cornelius

    110/125

    110Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Intel Software Toolset

    KAI OpenMP Intel Compilers Intel Performance Libraries Intel VTune Perf. Analyser

    KAI Assure Intel Thread CheckerKAI GuideView Intel Thread Profiler

    being integrated during 2003

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Intel Compiler Architecture

  • 8/13/2019 Intel Cornelius

    111/125

    111Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Intel Compiler Architecture

    C/C++

    Front End

    C/C++C/C++

    Front EndFront End

    Interprocedural analysis and optimizations:inlining, constant prop, whole program detect, mod/ref, points-to

    Interprocedural analysis and optimizations:Interprocedural analysis and optimizations:inlininginlining , constant prop, whole program detect, mod/ref, points, constant prop, whole program detect, mod/ref, points --toto

    Loop optimizations:data deps, prefetch, scalar repl, unroll/interchange/fusion/dist, auto-parallel/OpenMP

    Loop optimizations:Loop optimizations:datadata depsdeps , prefetch, scalar, prefetch, scalar replrepl , unroll/interchange/fusion/dist, auto, unroll/interchange/fusion/dist, auto --parallel/parallel/ OpenMPOpenMP

    Global scalar optimizations:partial redundancy elim, dead store elim, strength reduction, dead code elim

    Global scalar optimizations:Global scalar optimizations:partial redundancypartial redundancy elimelim , dead store, dead store elimelim , strength reduction, dead code, strength reduction, dead code elimelim

    Code generation:predication, software pipelining, global scheduling, register allocation, code generation

    Code generation:Code generation:predication, software pipelining, global scheduling, register alpredication, software pipelining, global scheduling, register al location, code generationlocation, code generation

    FORTRAN 77/95

    Front End

    FORTRAN 77/95FORTRAN 77/95

    Front EndFront End

    D i s

    a m

    b i g u a

    t i o n :

    t y p e s , a r r a y , p o

    i n t e r , s

    t r u c

    t u r e ,

    d i r e c

    t i v e s ,

    l d s a f e t y

    D i s a m

    b i g u a

    t i o n :

    D i s a m

    b i g u a

    t i o n :

    t y p e s ,

    a r r a y , p o

    i n t e r , s

    t r u c t u r e ,

    d i r e c t

    i v e s

    , l d s a

    f e t y

    t y p e s ,

    a r r a y , p o

    i n t e r , s

    t r u c t u r e ,

    d i r e c t

    i v e s

    , l d s a

    f e t y

    M a ch i n

    eM

    o d el

    M a ch i n

    eM

    o d el

    M a ch i n

    eM

    o d el

    P r of i l er

    P r of i l er

    P r of i l er

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Intel Compilers Version 7 0

  • 8/13/2019 Intel Cornelius

    112/125

    112Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Intel Compilers Version 7.0

    Released November 2002 Improved stability and optimization Supports Itanium 2 (-tpp2) More OpenMP 2.0 support Improved C99 standard support Improved gcc compatibility More and better reporting switches New Fortran directives (e.g. PREFETCH) Bridge to Version 8.0 (CVF IVF), improved

    compatibility with CVF

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Performance Counter

  • 8/13/2019 Intel Cornelius

    113/125

    113

    Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Light-weight performance analysis tool to complement

    VTune Leverage HPs excellent pfmon on Itanium Architecture Linux64

    EMEA HPTC Virtual Team

    Intel Itanium Architecture

    Intel Performance Libraries

  • 8/13/2019 Intel Cornelius

    114/125

    114

    Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Intel MKL (Math Kernel Library) Highly optimized library to provide high performance on critical

    kernel operations in science and engineering Parallelism built into the library for automatic SMP support Vector Math Library (VML)

    Intel IPP (Integrated Performance Primatives) Highly optimized functions to provide high performance on

    critical kernel operations for multi-media data types Available on multiple platforms to increases the portability of

    performance-based applications

    http://developer.intel.com/software/products/perflib

    EMEA HPTC Virtual Team

    Intel

    Itanium Architecture

    VML Performance

  • 8/13/2019 Intel Cornelius

    115/125

    115

    Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    X87

    Pentium 4 Processor

    Pentium III Processor

    Itanium Processor

    EMEA HPTC Virtual Team

    Intel

    Itanium Architecture

    Worldwide Support & Solution Centers

  • 8/13/2019 Intel Cornelius

    116/125

    116

    Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    Worldwide Support & Solution Centers

    OEM OEM SI/SP

    EMEA HPTC Virtual Team

    Intel

    Itanium Architecture

    General Optimizations

  • 8/13/2019 Intel Cornelius

    117/125

    117

    Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    General Optimizations

    -O0: disables optimization -O1: optimizes for speed without increasing code size -O2: optimizes for speed (default) -O3: enables -O2 plus more aggressive optimizations,

    may not improve performance for all programs -tpp2: Itanium 2 Code Generation (instruction mix) -fno-alias: assumes no aliasing in program (may be

    unsafe) -align: analyzes and reorders memory layout for

    variables and arrays (FTN only) -pad: enables changing variable and array memory

    layout (FTN only)

    EMEA HPTC Virtual Team

    Intel

    Itanium Architecture

    Interprocedural Optimization

  • 8/13/2019 Intel Cornelius

    118/125

    118

    Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    te p ocedu a Opt at o

    Extends optimizations across file boundaries.

    Compile & OptimizeCompile & Optimize

    Compile & OptimizeCompile & Optimize

    Compile & OptimizeCompile & Optimize

    Compile & OptimizeCompile & Optimize

    file1.c

    file2.c

    file3.c

    file4.c

    Without IPO (or withWithout IPO (or with --ipip ))

    Compile & OptimizeCompile & Optimize

    file1.c

    file4.c file2.c

    file3.c

    With IPOWith IPO

    EMEA HPTC Virtual Team

    Intel

    Itanium Architecture

    How IPO Works

  • 8/13/2019 Intel Cornelius

    119/125

    119

    Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    How IPO Works

    foo(optimizedexecutable)

    Link programicc -o foo -ipo foo.o

    2a. Compiler performs whole-programoptimizations

    2b. Compiler invokes linker to produceexecutable

    foo.o(fake object file)

    Compile programicc -c -ipo foo.c

    foo.il

    (un-optimizedintermediatelanguage files)

  • 8/13/2019 Intel Cornelius

    120/125

  • 8/13/2019 Intel Cornelius

    121/125

    EMEA HPTC Virtual Team

    Intel

    Itanium Architecture

    How PGO Works

  • 8/13/2019 Intel Cornelius

    122/125

    122

    Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    foo(instrumented

    executable)Compile+link to add instrumentationicc o foo -prof_gen foo.c

    12345678.dyn(dynamic profile)

    Execute instrumented program./foo

    pgopti.dpi

    (merged .dyn files)foo

    (optimizedexecutable)

    Compile+link using feedback

    icc o foo -prof_use foo.c

  • 8/13/2019 Intel Cornelius

    123/125

    EMEA HPTC Virtual Team

    Intel

    Itanium Architecture

    Itanium Tuning Tips

  • 8/13/2019 Intel Cornelius

    124/125

    124

    Copyright 2002-2003 Intel Corporation

    *Other brands and names are the property of their respective owners

    g p

    Enable the Compiler Software pipelining of key loops Pointer disambiguation in C codes Interprocedural Optimization Profile guided optimizations

    Utilize Cache Hierarchy (spacial & temporal locals) Use tuned libraries

    Use tuning tools Intel VTune Performance Analyzer

    Use Web resources http://developer.intel.com/itanium

  • 8/13/2019 Intel Cornelius

    125/125

    Thank You.

    www.intel.com