22
05-10-2017 © Atos Portals4 LND Overview Grégoire Pichon - Atos

Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

Embed Size (px)

Citation preview

Page 1: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

05-10-2017

© Atos

Portals4 LND Overview

Grégoire Pichon - Atos

Page 2: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

New Lustre Network Driver

Page 3: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

▶ communication infrastructure

– between Lustre clients and servers

▶ supports many commonly-used network types

– Infiniband, TCP/IP networks

– allows routing between networks

▶ key features

– high availability

– recovery

▶ Lustre Network Driver

– provides support for a particular network type

– implements LNet-LND api

LNet

LNet

socklnd o2iblnd

Portal RPC LNet

selftest

TCP/IP Infiniband

Lustre Network layer

3

Page 4: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

▶ ptl4lnd

– based on Sandia National Laboratories Portals 4 Network Programming Interface

LU-8932 lnet: define new network driver ptl4lnd

– kernel module: kptl4lnd.ko

– network name: ptlf

– network adapter identified by device number

▶ hardware

– Bull eXascale Interconnect

Portals 4 LND

LNet

socklnd o2iblnd

Portal RPC LNet

selftest

TCP/IP Infiniband

… ptl4lnd

BXI

networks=ptlf0(0),ptlf1(1)

New Lustre Network Driver

4

Page 5: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

Portals 4

▶ Common semantic for MPI and PGAS

– Target memory descriptors: “persistent” (one-sided ops.) or “use once” (two-sided ops.)

– Rich operations library: Put, Get, Swap, FetchAtomic

▶ Full hardware offloading from the host CPU

– Performance is not impacted by heavy load on the host CPU

– Triggered operations for protocol offloading (collectives, etc.)

▶ Non-connected reliable protocol

– No connection establishment time

– Constant memory footprint whatever the number of communicators

▶ Unexpected messages aggregation

– N to 1 communications optimization – single buffer for multiple messages

5

API for high-performance networking

Page 6: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

6

Portals 4 Communication Model

Portals Table

ME ME Portals Table Entry

Match Entry

Memory region

ME NI

Network Interface

MSG

request sent nid / pid portals idx matchbits

matchbits ma

matchbits mb

matchbits mz

Page 7: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

7

Ptl4lnd internals Small data transfer (from node A to node B)

Node A Node B

build ptl4lnd imm. msg with LNet hdr

copy LNet payload into msg IMMEDIATE message

with fixed matchbits

parse LNet hdr

copy payload from msg to LNet destination memory

match a persistent ME

lnd_send()

lnd_recv()

PtlPut() MSG

ME

MSG

persistent ME pre-posted with fixed matchbits

Page 8: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

8

Ptl4lnd internals RDMA data transfer (from node A to node B)

Node A Node B

build ptl4lnd rdma msg LNet hdr & matchbits

get a unique matchbits prepare a use-once ME

RDV message with fixed matchbits

parse LNet hdr

match a persistent ME

lnd_send()

lnd_recv()

PtlPut()

prepare a Memory Descriptor

BULK message with unique matchbits

PtlGet()

match use-once ME remove it from list

MSG unique

matchbits

MD

ME

persistent ME pre-posted with fixed matchbits

ME

MSG

Page 9: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

9

Ptl4lnd internals resource management

Portals Table overflow list

ME ME PTL4_PORTAL start length fixed matchbits persistent

Match Entry

Memory region RX Event Queue

MSG

IMM. message or RDV message

TX

tx descriptor

BULK message

MD Memory

Descriptor

priority list

ME ME start length unique matchbits use-once iovec

TX Event Queue

reception (RX) resources

transmission (TX) resources

ME

TX

tx descriptor

Page 10: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

▶ credits (32) and peercredits (8)

– number of concurrent sends to all and to one peer

– need uniform setting between cluster nodes

– higher value allows higher bandwidth but consumes more resources

▶ checksum (off)

– check integrity of non-bulk messages

▶ nscheds (2)

– number of scheduler threads that handle RX buffer, TX finalization, …

10

Ptl4lnd parameters and tuning

Page 11: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

▶ slots in event queues

– number of reserved slots

– total number of slots

▶ peer status

– nid-pid, state, credits, alive time

▶ statistics

– number of TX of different types

▶ dump TX and RX states

11

Ptl4lnd debug and statistics

# cat /sys/kernel/kptl4lnd/0/stats PUT GET IMM NOOP HELL NAK BULK tx 3 0 23 1 0 0 0

# cat /sys/kernel/kptl4lnd/0/peers <NID>:<PID> | <state> | <credits> | <alive> 24:9 | ACTIVE | 8 / 1 + 7 | 253s 25:9 | ACTIVE | 8 / 3 + 5 | 212s

# cat /sys/kernel/kptl4lnd/0/events tx events: 14 reserved / 65536

Page 12: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

12

Ptl4lnd performance achievements LNet performance

Lustre IEEL 2.7.21 BXI 1.2 NIC – SWITCH - NIC

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 256 512 768 1024

Thro

ugh

pu

t (M

iB/s

)

IO size

LNet Selftest – brw

write

read

portals4

Page 13: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

13

Ptl4lnd performance achievements Lustre performance

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 2 4 6 8 10 12

Thro

ugh

pu

t (M

iB/s

)

tasks

IOR bandwidth sequential - file-per-process - single client

write

read

Lustre IEEL 2.7.21 BXI 1.2 NIC – SWITCH - NIC

Page 14: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

▶ Finalize LND code

– perform some optimizations

– handle remaining corner cases

▶ Ensure compatibility with recent LNet changes

– currently runs on Lustre IEEL 2.7.21

– build and test on Lustre master

▶ Push code to the Lustre community

14

Ptl4lnd development Next steps

Page 15: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

Development Challenges

Page 16: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

▶ register to LNet

– lnet_register_lnd()

– lnet_unregister_lnd()

▶ provide a lnet_lnd structure

– lnd id (SOCKLND, O2IBLND, LOLND, GNILND, PTL4LND)

– startup/shutdown network communication on the network interface

– send/receive LNet message on the network interface

– notification /query on peer health / aliveness

– control commands

▶ use LNet callbacks

– lnet_parse(), lnet_finalize(), lnet_set-reply_msg_len(), …

16

LNet – LND interface

struct lnet_lnd ptlflnd = { .lnd_type = PTL4LND, .lnd_startup = ptlf_startup, .lnd_shutdown = ptlf_shutdown, .lnd_ctl = ptlf_ctl, .lnd_send = ptlf_send, .lnd_recv = ptlf_recv, .lnd_query = ptlf_query, };

looks simple …

Page 17: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

▶ lack of documentation

– routine semantic

• when should a LND call lnet_notify() ?

• what LNet callbacks should/shall a LND use ?

– parameters description

– call context

17

LNet – LND interface … but still some interrogations

/* Start receiving 'mlen' bytes of payload data, skipping the following * 'rlen' - 'mlen' bytes. 'private' is the 'private' passed to * lnet_parse(). Return non-zero for immedaite failure, otherwise * complete later with lnet_finalize(). This also gives back a receive * credit if the LND does flow control. */ int (*lnd_recv)(struct lnet_ni *ni, void *private, struct lnet_msg *msg, int delayed, unsigned int niov, struct kvec *iov, lnet_kiov_t *kiov, unsigned int offset, unsigned int mlen, unsigned int rlen);

Page 18: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

▶ LNet selftest is not convenient for LNet debugging or LND bringup

– test infrastructure is setup with … LNet messages

– not designed to perform 1 LNetPut() or LNetGet() operation

– cannot test

• specific (exotic) memory region transfers

• error cases (non matching ME, short MD)

▶ development of a LNet test kernel module

– define a pseudo device for ioctl command

• post LNet ME-MD for reception

• launch LNet data transfer

– also post permanent LNet ME-MD

18

Tests and bringup

Page 19: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

Opportunities offered by modern Interconnects

Page 20: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

▶ Atomic operations

– atomically read and update data located in remote memory regions

– host bypass, low latency

– operations

• min, max, sum, prod, or, and, swap, conditional swap, …

– usage

• distributed lock

• object sequence number

– LNetAtomic(), LNetFetchAtomic(), LNetSwap()

20

Network advanced features (1) … that Lustre could benefit

Host NIC

object sequence number

Atomic Increment

Hosts

Page 21: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management

▶ Triggered operations

– setup operations triggered by incoming messages

– host bypass, low latency

– usage:

• tree based reduction

• recovery synchronization

– LNetTriggeredPut(), LNetTriggeredAtomic()

21

Network advanced features (2) … that Lustre could benefit

Host NIC

Hosts N * Atomic Increments

Host NIC

Hosts N * Atomic Increments

Triggered Atomic

Increments

Page 22: Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device number hardware – Bull eXascale Interconnect Portals 4 LND LNet socklnd o2iblnd Portal

Atos, the Atos logo, Atos Codex, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company, Unify, Yunano, Zero Email, Zero Email Certified and The Zero Email Company are registered trademarks of the Atos group. September 2017. © 2017 Atos. Confidential information owned by Atos, to be used by the recipient only. This document, or any part of it, may not be reproduced, copied, circulated and/or distributed nor quoted without prior written approval from Atos.

Thanks [email protected]