3 Volt a Ire

  • Upload
    heckm

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

  • 8/8/2019 3 Volt a Ire

    1/36

    2009 Voltaire Inc.

    Voltaire Unified Fabric ManagerA new dimension to performance analysis and tuning

    Ghislain de Jacquelot

    [email protected]

  • 8/8/2019 3 Volt a Ire

    2/36

    Voltaire Inc. 2Confidential - Internal

    Introducing

    Voltaires Grid Director 4000 Series

    4036 - 36 ports

    Firstgenerally availablecommercial-grade QDR switchesin the market

    Lowest latencyswitch at100ns/300ns port-to-port

    Smartswitch with advancedmanagement capabilities on-board

    Most mature, 4th Generationswitch family and switch silicon

    Most scalable with HyperScaletechnology

    4700 From 324 ports to

  • 8/8/2019 3 Volt a Ire

    3/36

    Voltaire Inc. 3Confidential - Internal

    Infiniband: a black box ?

  • 8/8/2019 3 Volt a Ire

    4/36 Voltaire Inc. 4Confidential - Internal

    An Infiniband Fabric is not a black box (1/2)

    Requires Hardware management

    Detect failures, communication problems

    Inside the Infiniband Fabric- Port counters

    - Port status (QDR,DDR,SDR 4X,2X,1X)

    - Firmware upgrades (Switch and HCA ASICs)

    Outside the Infiniband Fabric

    - Chassis

    - Power supplies

    - Fans

    - Temperature

    - Chassis software updates (Switch management)

  • 8/8/2019 3 Volt a Ire

    5/36 Voltaire Inc. 5Confidential - Internal

    An Infiniband Fabric is not a black box (2/2)

    What about performance ?

    Blocking vs non-blocking fabrics ?

    Influence of routing algorithms ? Congestion ?

    Mixing different protocols on the same fabric ?

    Running multiple jobs on the same fabric ?

    Performance monitoring Tools ?

  • 8/8/2019 3 Volt a Ire

    6/36 2009 Voltaire Inc.

    Some Infiniband technology

  • 8/8/2019 3 Volt a Ire

    7/36 Voltaire Inc. 7Confidential - Internal

    Fabric ?

    is made of switch ASICs interconnected together

    Mellanox InfiniScale III (aka Anafa): 24 ports

    Mellanox InfiniScale IV (aka Shaldag): 36 ports

    24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports

    24 ports 24 ports 24 ports 24 ports

    12

    Nodes

    12

    Nodes

    12

    Nodes

    12

    Nodes

    12

    Nodes

    12

    Nodes

    12

    Nodes

    12

    Nodes

    Inside a 96 ports switch

  • 8/8/2019 3 Volt a Ire

    8/36 Voltaire Inc. 8Confidential - Internal

    Blocking ?

    Defines the bandwidth ratio between layers in the fabric

    24 ports

    12

    Nodes

    12Uplinks

    24 ports

    16

    Nodes

    8Uplinks

    FullyNon-Blocking

    50%Blocking

    24 ports

    20

    Nodes

    4Uplinks

    20%Blocking

  • 8/8/2019 3 Volt a Ire

    9/36 Voltaire Inc. 9Confidential - Internal

    Congestion ?

    Example: All orange nodes write simultaneously to the IOnode (red)

    24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports

    24 ports 24 ports 24 ports 24 ports

    CN CN CN CN 12

    Nodes

    12

    Nodes

    12

    Nodes

    IO

    NodeCN

    CN CNCN CNCN CNCN

  • 8/8/2019 3 Volt a Ire

    10/369 Voltaire Inc. 10Confidential - Internal

    Congestion Example

    Degradation due to node oversubscription

    Destination port in saturation (multiple sources)

    Congestion spread across the fabric ALL other flows drop to 20% of capacity

    Take time to recover

    Common with storage traffic

    drop

    recovery

  • 8/8/2019 3 Volt a Ire

    11/369 Voltaire Inc. 11Confidential - Internal

    Routing ?

    InfiniBand packets are destination routed based on theDestination Logical ID (DLID) field in the header

    In IB: DLID=route (not only remote address)DLIDs are 16 bits

    48K values are used for unicast

    16K values are used for multicastAt each switch ASIC, the incoming unicast DLID

    is used as an index into a Linear ForwardingTable (LFT) that returns the outgoing switch

    port number

    E.g. the InfiniScale III ASIC supports all 48K possible LFT entries

    Out Port #

    DLID

    012345678

    91011

  • 8/8/2019 3 Volt a Ire

    12/36

  • 8/8/2019 3 Volt a Ire

    13/36

    9 Voltaire Inc. 13Confidential - Internal

    Communication Patterns (un-balanced)

    A B C D E F G H

    Communication pattern:A-CB-ED-G

    F-H2:1 link contention:

    A->C and B->E shareuplink to Switch 1 port 1

    G->D and H->F share

    uplink to Switch 2 port 4

    1 2

    3 4

    1 2

    3 4

    1 2

    3 4

    1 2

    3 4

    Switch 1

    1 2 3 4

    112233

    44

    ABCDEF

    GH

    3412

    1212

    ABCD

    EFGH

    1234

    1212

    ABCD

    EFGH

    1212

    3412

    ABCD

    EFGH

    1212

    1234

    ABCD

    EFGH

    112233

    44

    ABCDEF

    GH

    Switch 2

    1 2 3 4do

    wnlinksu

    plin

    ks

    IB path2 symmetric IB paths

  • 8/8/2019 3 Volt a Ire

    14/36

    9 Voltaire Inc. 14Confidential - Internal

    Optimization of Parallel Applications ?

    Single-thread optimization

    Some examples:

    Instruction Pipelining

    Blocking

    Prefetch data

    Tools: processor counters, profiling tools, compiler reports, etc

    Goal: Overcome processor, cache, memory architecture contraints

    Parallel optimization, scalability

    Some examples: Load Balancing

    Mix OpenMP and MPI

    Barrier optimization

    Tools: MPI Profilers (Intel Trace Analyzer, etc)

    Goals: Overcome Balancing issues, increase computation to communication ratio, use parallel IO,etc

    Fabric optimization ?

    Benchmarking and Production environment are different

    Systems used simultaneously by several applications, several kinds of traffic.

    Handling efficiently multiple concurrent flows

  • 8/8/2019 3 Volt a Ire

    15/36

    9 Voltaire Inc. 15Confidential - Internal

    Observations

    Blocking in cut through networks is a big issue

    Different traffic classes have different requirements

    Collectives and storage require congestion control

    IPC requires low-latency (high-priority)

    Storage may use more bandwidth and not be latency sensitive

    Hardware based adaptive routing not efficient with bursty or storage traffic

    Job layout can influence routing decisions IPC traffic typically stays within a job, or have unique patterns

    Storage traffic fan into storage nodes

    Management spread into all nodes

    Hardware capabilities can be destructive if used inappropriately

    E.g. mis-configured adaptive routing or congestion management

  • 8/8/2019 3 Volt a Ire

    16/36

    9 Voltaire Inc. 16Confidential - Internal

    Introducing

    Voltaire UFM Unified Fabric Manager

    Monitor, Analyze and Optimize

    Ensure fabric health and performance visibility

    Unique visibility into fabric traffic and bottlenecks

    Optimize application performance Benchmark performance in real life

    (weve managed to see 10X improvements)

    Manage the scale-out Application centric platform

    Efficient operations to thousands of fabricresources

    Automate configurations and manage changes on the fly

    Increase fabric up-time and resiliency better utilization

  • 8/8/2019 3 Volt a Ire

    17/36

    9 Voltaire Inc. 17Confidential - Internal

    Advanced Monitoring and Analysis

    Monitor & analyze fabric performance

    B/W utilization

    Unique congestion monitoring

    Dashboard for aggregated fabric view

    Real-time fabric-wide health monitoring

    Monitor events and errors through-out the fabric Threshold based alarms

    Granular monitoring of host and switch parameters

    Innovative congestion mapping

    One view for fabric-wide congestion and traffic patterns

    Enables root cause analysis for routing, job placement orresource allocation inefficiencies

    All is managed at the application/aggregationlevel

  • 8/8/2019 3 Volt a Ire

    18/36

    9 Voltaire Inc. 18Confidential - Internal

    Fabric Optimization with UFM

    Characterize applicationCharacterize application

    traffic and prioritiestraffic and priorities

    Feedback and AnalysisApplication Modeling

    (CLI / GUI / API)

    OptionalSchedulersSchedulers

    Fabric Optimization

    Fabric virtualization andFabric virtualization andQoSQoS

    Optimize routing and jobOptimize routing and jobplacementplacement

    UFMUFM

    Show traffic andShow traffic and

    congestion informationcongestion information

    Monitoring

  • 8/8/2019 3 Volt a Ire

    19/36

    9 Voltaire Inc. 19Confidential - Internal

    UFM Application Centric Approach

    PhysicalInfrastructure

    VirtualInfrastructure

    Applications

    Map application requirements to fabric policies and

    Map element status to application status

    Fabric

    Policy

    Monitoring

    C bi i UFM i h I d L di

  • 8/8/2019 3 Volt a Ire

    20/36

    9 Voltaire Inc. 20Confidential - Internal

    Combining UFM with Industry LeadingSchedulers

    Enabling Intelligent Performance Driven Job Scheduling

  • 8/8/2019 3 Volt a Ire

    21/36

    9 Voltaire Inc. 21Confidential - Internal

    UFMs traffic aware routing

    Todays routing algorithms are static while clusters aredynamic

    Nodes are moving in and out of the cluster

    Traffic patterns change

    Static algorithms cant cope with changes resulting in congestions and in-efficiencies

    Voltaire routing performance optimization Optimizations for various topologies enhanced during last years in large

    clusters

    New major conceptual shift from static to traffic pattern based algorithm

    Traffic model can be derived automatically from topology

    Voltaires enhancements are built on top of OpenSM in a modular plug-inarchitecture

    Voltaires routing optimizations improve fabric performancewithout increasing cost

    P f ti i ti titi i d

  • 8/8/2019 3 Volt a Ire

    22/36

    9 Voltaire Inc. 22Confidential - Internal

    Performance optimization: partitioning andQoS

    UFM enables to run multiple clustersor separate application jobs on thesame infrastructure

    Drag and drop configurationautomatically creates dedicated IPCand virtual I/O to each cluster

    Quality of Service can be associatedwith fabric partitions so criticalapplications get priority in fabricrouting queues

    Easy configuration of QoS via GUI or CLI assignment to pre-defined service levels

    Changes in application needs is easilyreconfigured by simple re-allocation ofservers to apps or networks

    Drag and drop assignment to networktriggers all configurations in the back-stage

    Critical applications can be allocated the right resources and priority

  • 8/8/2019 3 Volt a Ire

    23/36

    2009 Voltaire Inc.

    Benefits

  • 8/8/2019 3 Volt a Ire

    24/36

    9 Voltaire Inc. 24Confidential - Internal

    Boost Apps Performance with Voltaire UFM

    Optimize Real-Life Environments

  • 8/8/2019 3 Volt a Ire

    25/36

    9 Voltaire Inc. 25Confidential - Internal

    Test Environment

    12 nodesrunning abandwidth

    consuming job

    2 nodes runninga latency critical

    jobGoal: achievebestperformancewith Latencycritical tasks

    W/O Partitioning Latency degradation of X

  • 8/8/2019 3 Volt a Ire

    26/36

    9 Voltaire Inc. 26Confidential - Internal

    W/O Partitioning Latency degradation of ~ X215%

    Latency job running alone(Latency = ~0.000210)

    Bandwidth job added onsame partition(Latency = ~0.000450)

  • 8/8/2019 3 Volt a Ire

    27/36

    9 Voltaire Inc. 27Confidential - Internal

    Create Partitions and Set QoS in UFM

    Create 2 Logical Groups

    Latency job

    B/W oriented job

    Create 2 Networks

    One for each job

    Assign Service Level

    SL0 Low Latency Queue

    SL1 50% (high/bandwidth)

    (SL2 25%, SL3 25%)

    UFM automatically createsvirtual NICs, partitions andService Level definitions

    Run jobs with isolation and QoS return almost to

  • 8/8/2019 3 Volt a Ire

    28/36

    9 Voltaire Inc. 28Confidential - Internal

    Run jobs with isolation and QoS return almost tooriginal performance (~5% impact only)

    Latency job running alone(Latency = ~0.000210)

    Bandwidth job added onsame partition(Latency = ~0.000450)

    Separate partitionsand QoS(Latency = ~0.000220) (!)

    Voltaire UFM

  • 8/8/2019 3 Volt a Ire

    29/36

    9 Voltaire Inc. 29Confidential - Internal

    Voltaire UFM

    Redefining Fabric Management

    OpenSMSubnet Manager only, Technology Test Bed

    Voltaire engineer is the OpenSM Maintainer

    Voltaire UFMMonitor, Analyze & Optimize application

    performance, Automate and ease fabricmanagement, Uses OpenSM withadvanced routing Plug-ins

    Other Fabric Mgmt. SolutionLimited Proprietary SMDevice/Port oriented limited viewerand some troubleshooting tools

    Voltaire GridVisionBasic monitoring & TroubleshootingRich GUI, CLI, SNMP functionality,Voltaire SM, Embedded in Switches

    Questions ?

  • 8/8/2019 3 Volt a Ire

    30/36

    2009 Voltaire Inc.

    Open-MPI Accelerator (OMA)

    V lt i OMA B fit

  • 8/8/2019 3 Volt a Ire

    31/36

    9 Voltaire Inc. 31Confidential - Internal

    Voltaire OMA Benefits

    Accelerating standard, open source Open-MPI

    Significant performance improvement (shmem only)

    More effective when there is more intra-node communications (between cores)

    Depends on the HW (# of cores, # of sockets) and the traffic pattern

    Enhanced documentation

    Open-MPI expertise RoadRunner and many othersWorks with InfiniBand and Ethernet (iWARP and TCP)

    H Sh d M i D T d ?

  • 8/8/2019 3 Volt a Ire

    32/36

    9 Voltaire Inc. 32Confidential - Internal

    How Shared Memory is Done Today?

    Shared memory

    RAMRAM

    CPU socket CPU socket

    4 CPUCores

    NUMAcc

    12

    #1

    #2

    HCA/iWARP

    1. Process #1 writes the datainto shmem RAM

    2. Process #2 reads the datafrom shmem RAM

    Th OMA W

  • 8/8/2019 3 Volt a Ire

    33/36

    9 Voltaire Inc. 33Confidential - Internal

    The OMA Way

    Shared memory

    RAMRAM

    CPU socket CPU socket

    NUMAcc

    1

    1. For large messages Kernel willcopy data from process #1directly into process #2 (save

    one copy), small massages willstay as today

    #1

    #2

    HCA/iWARP

    OMA Fluent Aircraft Benchmark

  • 8/8/2019 3 Volt a Ire

    34/36

    9 Voltaire Inc. 34Confidential - Internal

    OMA - Fluent Aircraft Benchmark

    Fluent Aircraft

    0

    100

    200

    300

    400

    500

    600

    700

    800

    0 5 10 15 20 25 30 35

    # of processes

    Flu

    entRating

    Open MPI with OMA Open MPI

    9% 7% 11% 25%10%

    * OMA improves Fluent Aircraft Benchmark by up to 25%

  • 8/8/2019 3 Volt a Ire

    35/36

    9 Voltaire Inc. 35Confidential - Internal

    eff. bandwidth/proc for alltoall

    0

    500

    1000

    1500

    2000

    2500

    3000

    1 10 100 1000 10000 100000 1000000 10000000

    bytes

    MB/s

    HP-MPI MVAPICH2

    OPENMPI OPENMPI+OMA

    pingpong bandwidth

    0

    1000

    2000

    3000

    4000

    5000

    6000

    1, E+ 00 1, E+ 01 1, E+ 02 1, E+ 03 1, E+ 04 1, E+ 05 1, E+ 06 1, E+ 07

    bytes

    MB/s

    HP-MPIMVAPICH2OPENMPI

    OPENMPI+OMA

  • 8/8/2019 3 Volt a Ire

    36/36

    2009 Voltaire Inc.

    Questions ?