31
© 2005 IBM HPS ScicomP 11 Charles Grassl IBM May, 2005

ScicomP 11 Charles Grassl IBM May, 2005 · 2005. 7. 1. · 2 © 2005 IBM Corporation Agenda • Evolution • Technology • Performance characteristics • RDMA • Collectives

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • © 2005 IBM

    HPS

    ScicomP 11Charles Grassl

    IBMMay, 2005

  • 2 © 2005 IBM Corporation

    Agenda

    • Evolution• Technology• Performance characteristics

    • RDMA• Collectives

  • 3 © 2005 IBM Corporation

    Evolution

    • High Performance Switch (HPS)• Also Known As “Federation”

    • Follow on to SP Switch2• Also known as “Colony”

    Generation ProcessorsSwitch POWER2

    SP Switch POWER2 → POWER3SP Switch 2 (Colony) POWER3 → POWER4

    HPS (Federation) POWER4 → POWER5

  • 4 © 2005 IBM Corporation

    Technology

    • Internal network• In lieu of, e.g. Gig Ethernet

    • Multiple links per node• Match number of links to number of processors

    SP Switch2 HPS

    15 microsec.

    500 Mbyte/s

    1 adaptor perlogical node

    Latency: 5 microsec.

    Bandwidth 2 Gbyte/s

    Configuration:1 adaptor, 2 links, per MCM; 16 Gbyte/s per 32 processor node

  • 5 © 2005 IBM Corporation

    HPS Packaging

    • 4U, 24-inch drawer • 16 ports for server-to-switch• 16 ports for switch-to-switch connections • Host attachment directly to server bus via 2-

    link or 4-link Switch Network Interface (SNI) •Up to two links per pSeries 655•Up to eight links per pSeries 690

  • 6 © 2005 IBM Corporation

    HPS Switch Configuration

    Switch board

    GX bus

    Power4 and Power5 Servers

    •Link drivers•Copper driver•Fiber optics driver

    LDC

    GX bus

    RAM

    GX busRAM

    Canopus

    Canopus

    LDC

    GX bus

    RAM

    GX busRAM

    Canopus

    Canopus

    LDC

    GX bus

    RAM

    GX busRAM

    Canopus

    Canopus

    HPSSwitchChip HPS Adapters

    LDC

    GX bus

    RAM

    GX busRAM

    Canopus

    Canopus

  • 7 © 2005 IBM Corporation

    HPS Software

    • MPI-LAPI (PE V4.1)•Uses LAPI as the reliable transport•Library uses threads, not signals for async activities

    • Existing applications binary compatible• New performance characteristics

    •Eager•Bulk Transfer• Improved collective communication

  • 8 © 2005 IBM Corporation

    HPS Software Architecture

    DDIPLAPI

    HPSSMA3+ Adapter

    TCP

    Sockets

    MPI

    HYP

    VSD

    GPFS

    Application

    UDP

    PESSL

    User Space Kernel Space

    Load

    Leve

    ler

    CSM

    IF_LSHAL

    ESSL

  • 9 © 2005 IBM Corporation

    Supported Communication Modes

    • FIFO Mode• Message chopped into 2K

    packet chunks on the host and copied by CPU

    • Memory bus crossing depends on caching. At least 1 IO bus crossing

    • Remote Direct Memory Access (RDMA)• No slave side protocol• CPU offload • Enhanced Programming

    model• 1 IO bus crossing

    UserBuffer

    CPU

    Network FIFO

    Adapter

    Ld/St

    Ld/St

    DMA

    RDMA

  • 10 © 2005 IBM Corporation

    Underlying Message Procedures

    • Protocols:• Rendezvous

    • “Large” messages• Eager

    • “Small” messages• MP_EAGER_LIMIT

    • Range: 0 - 65536

    • Mechanisms• Packet• Bulk• MP_BULK_MIN_MSG_SIZE

    • Range: any non-negative integer

  • 11 © 2005 IBM Corporation

    MPI Transfer Protocols

    Send Recv.Mess. Header.Small Messages:

    Eager

    Send Recv.Header.

    Send Recv.Mess.

    Large Messages:Rendezvous

  • 12 © 2005 IBM Corporation

    MPI Transfer Mechanisms

    SendPackets

    switch Recv.

    Small Messages: Packets

    SendBulk

    Recv.

    switch

    Large Messages: Bulk

  • 13 © 2005 IBM Corporation

    Remote Direct Memory Access (RDMA)

    • Overlap of computation and communication (possible)• Fragmentation and reassembly offloaded to the adapter• Minimize packet arrival interrupts• Asynchronous messaging applications

    • All tasks sharing adapter not communicating at the same time• One sided programming model• Zero copy transport

    • Reduced memory subsystem load• Striping of very large messages

    • Implications to interference with other tasks if copying

  • 14 © 2005 IBM Corporation

    New MPI Performance Models

    • Possible striping of single message•Bandwidth ~ n * 2 Gbyte/s

    • Small dependence on user large pages• Collectives• Eager limit

  • 15 © 2005 IBM Corporation

    MPI Single Messages:Large Page vs. Small Page

    Rate vs. Length

    0500

    100015002000250030003500

    0 500000 1000000 1500000 2000000

    Length

    Mby

    te/s

    LPSP

    p655 1.5 GHz, HPS, RDMA

  • 16 © 2005 IBM Corporation

    MPI Single Messages:LP vs. SP

    Time vs. Length

    0

    250

    500

    750

    1000

    0 500000 1000000 1500000 2000000

    Length

    Mic

    rose

    cond

    s LPSP

  • 17 © 2005 IBM Corporation

    MPI Single Messages

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    0 500000 1000000 1500000

    • Message striping at ~500000 bytes

    • Bandwidth:• 1.5 Gbyte/s “Modest” size• 3 Gbyte/s Large size

    • Small sensitivity to page size

  • 18 © 2005 IBM Corporation

    MPI Single Messages:Eager Limits

    Time vs. Length

    010203040506070

    0 2000 4000 6000 8000 10000

    Length

    Mic

    rose

    cond

    s

    Eager 4096Eager 0

  • 19 © 2005 IBM Corporation

    Eager Limit

    Time vs. Length

    0

    10

    20

    30

    40

    50

    60

    70

    0 2000 4000 6000 8000 10000

    Length

    Mic

    rose

    cond

    s

    Eager 4096Eager 0

    • Reduce latency from 20 microseconds to 7 microseconds

  • 20 © 2005 IBM Corporation

    MPI Single Messages:RDMA vs. no RDMA

    Rate vs. Length

    0500

    100015002000250030003500

    0 500000 1000000 1500000 2000000

    Length (bytes)

    Mby

    te/s RDMA

    no RDMA

    p655 1.5 GHz, HPS, RDMA

  • 21 © 2005 IBM Corporation

    RDMA

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    0 500000 1000000 1500000

    • Message striping starts at 500000 bytes• Adjust with

    MP_BULK_MIN_MSG_SIZE

  • 22 © 2005 IBM Corporation

    MPI Collectives

    • Take special advantage of 64-bit addressing•More aggressive algorithms

    • Example:•MPI_Bcast, 32 tasks: 25% faster with 64-bit addressing

  • 23 © 2005 IBM Corporation

    MPI Bcast:32-bit vs. 64-bit

    64 and 32 Task MPI_Bcast

    0500

    1000150020002500300035004000

    0 250000 500000 750000

    Length

    Mic

    rose

    cond

    s

    64-task 64-bit64-task 32-bit32-task 64-bit32 task 32-bit

    p655 1.5 GHz, HPS, RDMA

  • 24 © 2005 IBM Corporation

    MPI_Bcast:32-bit vs. 64-bit

    32 Tasks MPI_Bcast

    0500

    1000150020002500

    0 250000 500000 750000

    Length

    Mic

    rose

    cond

    s

    64-bit32-bit

    p655 1.5 GHz, HPS, RDMA

  • 25 © 2005 IBM Corporation

    MPI_Bcast:32-bit vs. 64-bit

    64 Tasks MPI_Bcast

    0500

    1000150020002500300035004000

    0 250000 500000 750000

    Length

    Mic

    rose

    cond

    s

    64-bit32-bit

    p655 1.5 GHz, HPS, RDMA

  • 26 © 2005 IBM Corporation

    Application Considerations• MPI-LAPI has different architecture than prior version

    • Bulk transfer:• Larger messages (>500K are used)

    • Set MP_SINGLE_THREAD=yes (if possible)• 32-bit applications:

    • Will not use LAPI shared memory for large messages• Convert to 64-bit

    • Will not use MPI Collective Communication optimizations• Convert to 64-bit.

    • Applications that use signal handlers may require some changes

    • MPL is no longer supported

  • 27 © 2005 IBM Corporation

    MPI Environment Variables

    Environment Variable Recommend Value

    MP_EUILIB usMP_EUIDEVICE sn_single, sn_allMP_SHARED_MEMORY yes

    MP_USE_BULK_XFER yesMP_SINGLE_THREAD yes*

    MP_BULK_MIN_MSG_SIZE 128 kbyte (default)

    * If possible

  • 28 © 2005 IBM Corporation

    Bandwidth Structure:Performance Aspects

    • Shared memory• 3 Gbyte/s POWER4• 4 Gbyte/s POWER5

    • Large pages• Reduced effect

    • Bulk Transfer• 3 Gbyte/s for two adaptors

    • Eager Limit• 15-20 microsec. 5-7 microsec

    • Single threaded• 1-2 microsec. reduced latency

  • 29 © 2005 IBM Corporation

    HPS Performance Summary

    Rate vs. Length

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    0 500000 1000000 1500000 2000000

    -

    LPSP

    • High asymptotic peak bandwidth• ~4x vs. Colony

    • Extra “kink” in performance curve• Bulk Transfer

    • Small message performance improvement

    • Low latency• 5 microsecond.

  • 30 © 2005 IBM Corporation

    Prescription For Use of RDMA

    • Add to LL configuration file:• /usr/lpp/LoadL/full/samples/LoadL_config

    • SCHEDULE_BY_RESOURCES = RDMA• Verification:

    •Run with MP_INFOLEVEL=2

    •Stderr must contain the text:• “Bulk Transfer is enabled”

    • Running:•LoadLeveler switch

    • # @ bulkxfer = yes

  • 31 © 2005 IBM Corporation

    Summary

    • HPS• Bandwidth

    • Bulk Transfers• 4x higher bandwidth (large messages)

    • Latency• 5 microsecond.

    HPSAgendaEvolutionTechnologyHPS PackagingHPS Switch ConfigurationHPS SoftwareSupported Communication ModesUnderlying Message ProceduresMPI Transfer ProtocolsMPI Transfer MechanismsRemote Direct Memory Access (RDMA)New MPI Performance ModelsMPI Single Messages:Large Page vs. Small PageMPI Single Messages:LP vs. SPMPI Single MessagesMPI Single Messages:Eager LimitsEager LimitMPI Single Messages:RDMA vs. no RDMARDMAMPI CollectivesMPI Bcast:32-bit vs. 64-bitMPI_Bcast:32-bit vs. 64-bitMPI_Bcast:32-bit vs. 64-bitApplication ConsiderationsMPI Environment VariablesBandwidth Structure:Performance AspectsHPS Performance SummaryPrescription For Use of RDMASummary