32
1 Lustre Interface Bonding Olaf Weber Sr. Software Engineer

Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

1

Lustre Interface Bonding

Olaf Weber Sr. Software Engineer

Page 2: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

2 ©2015 SGI

• A long-standing wish list item known under a variety of names: – Interface bonding

– Channel bonding

– Multi-rail

• Fujitsu implemented o2iblnd-level code and made it available to the community

• This proposal is a collaboration between SGI and Intel

• The goal is to land multi-rail support in Lustre

2

Interface Bonding

Page 3: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

3 ©2015 SGI

Why Multi-Rail?

SGI® UV™ 300: 32-socket NUMA System SGI® UV™ 3000: 256-socket NUMA System

Page 4: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

4 ©2015 SGI

Based on feedback for the Fujitsu code

• Mixed-version clusters

• Simple configuration

• Adaptable

• LNet-level implementation

4

Design Constraints

Page 5: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

5 ©2015 SGI 5

Example Lustre Cluster

MGT

MDT

OST

OST

MGS Client

MDS

OSS

OSS

Client

UV OSS OST

Page 6: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

6 ©2015 SGI 6

Mono-rail Single Fabric

MGT

MDT

OST

OST

MGS Client

MDS

OSS

OSS

Client

UV OSS OST

o2ib0

Page 7: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

7 ©2015 SGI 7

LNets in a Single Fabric

MGT

MDT

OST

OST

OST

o2ib0

o2ib1

o2ib2

o2ib3

MGS Client

MDS

OSS

OSS

Client

UV OSS

Page 8: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

8 ©2015 SGI 8

Multi-rail Single Fabric

MGT

MDT

OST

OST

OST

MGS Client

MDS

OSS

OSS

Client

UV OSS

o2ib0

Page 9: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

9 ©2015 SGI 9

Multi-rail Dual Fabric

MGT

MDT

OST

OST

OST

MGS Client

MDS

OSS

OSS

Client

UV OSS

o2ib0

o2ib1

Page 10: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

10 ©2015 SGI 10

Mixed-Version Clusters

A Single Multi-Rail Node

Peer Version Discovery

10

Page 11: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

11 ©2015 SGI 11

A Single Multi-Rail Node

MGT

MDT

OST

OST

OST

MGS Client

MDS

OSS

OSS

Client

UV OSS

o2ib0

Page 12: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

12 ©2015 SGI

• There is no LNet end-to-end versioning

• LND versioning does not work across LNet

routers

• The LNet ping protocol can be used.

A set bit in lnet_ping_info_t::pi_features

indicates multi-rail capability.

12

Peer Version Discovery

Page 13: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

13 ©2015 SGI

A simple version discovery protocol:

1. LNet keeps track of all known peers

2. On first communication, do an LNet ping

3. The node now knows the peer version

The ping reply also contains a list of the

interfaces of the peer. Can we use that?

13

Peer Version Discovery

Page 14: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

14 ©2015 SGI 14

Easy Configuration

Peer Interface Discovery

Configuring Interfaces on a Node

Dynamic Configuration

14

Page 15: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

15 ©2015 SGI

• Peer Version Discovery gives a node the

list of the peer’s interfaces.

• For the simple cases, this is all a node

needs to know about a peer.

• The peer also needs to know the node’s

interfaces.

Push the node’s interface list to the peer.

15

Peer Interface Discovery

Page 16: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

16 ©2015 SGI

The push would be like an LNet ping

• Except LNet ping uses LNetGet()

• The push uses LNetPut()

The push should be safe:

• A downrev LNet router can forward a push

• A downrev peer returns “protocol error”

16

Peer Interface Discovery

Page 17: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

17 ©2015 SGI

How does a node know its own interfaces?

Similar to current methods.

• LNet module options line

– networks=o2ib(ib0,ib1)

– networks=o2ib(ib0[2],ib1[6])[2,6]

• DLC uses the same syntax

– It uses the same in-kernel parser

17

Configuring Interfaces on a Node

Page 18: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

18 ©2015 SGI

What about credits?

• Credits are assigned per interface.

• This applies to both local and peer credits.

• More interfaces – more credits.

• The defaults of tunables are unchanged.

18

Configuring Interfaces on a Node

Page 19: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

19 ©2015 SGI

Adding an interface:

1. Enable new interface

2. Push updated interface list to peers

Removing an interface:

1. Push updated interface list to peers

2. Disable existing interface

19

Dynamic Configuration

Page 20: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

20 ©2015 SGI 20

Adaptable

Interface Selection

Extended Routing

Additional Considerations

20

Page 21: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

21 ©2015 SGI

Select a local-peer interface pair to send.

• Direct connection preferred

• LNet network type (anything but TCP)

• NUMA criteria

– Memory locality

– Process locality

• Local credits

• Peer credits

21

Interface Selection

Page 22: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

22 ©2015 SGI

Fabrics can have a complicated topology.

• Preferred point-to-point connections within an LNet

• Prefer an LNet over another (for a subset of its NIDs)

Routing Enhancements

22

Page 23: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

23 ©2015 SGI

Try to be NUMA friendly.

• Nodes do not know each other’s topology

• An RPC is a request-response pair.

• Remember origin interface of request

• Prefer origin interface for response

23

Extra Considerations

Page 24: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

24 ©2015 SGI

On a send failure, a message can be resent

on another local-remote interface pair, until

all possibilities have been exhausted.

This adds some extra resiliency for network

failures to Lustre.

24

Extra Considerations

Page 25: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

25 ©2015 SGI

Node failure introduces some corner cases.

• Reboot with downrev software

– Upper layers (ptlrpc) do detect node failure

– They can inform LNet so it can reset its state

• NID reuse by a different node

– Node identity is now separate from NID

– Special NIDs on the loopback network?

25

Extra Considerations

Page 26: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

26 ©2015 SGI 26

LNet-level Implementation

Implementation Notes

26

Page 27: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

27 ©2015 SGI

Datastructure changes

• Split lnet_ni into lnet_lnet and lnet_ni

• Split lnet_peer into lnet_peer and

lnet_peerni

• Track preferred routes in lnet_net

• Track preferred lnet_ni in lnet_peerni

(derived from the routes info in lnet_net)

27

Implementation Notes

Page 28: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

28 ©2015 SGI

Use LNET_NID_ANY for the self parameter of

LNetGet() and LNetPut() when sending an RPC

request. This tells LNet to use whichever local-

remote interface pair it seems most suitable.

Use the originator NID when sending an RPC

response. This tells LNet that this particular

local-remote pair is strongly preferred.

28

Implementation Notes

Page 29: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

29 ©2015 SGI

The Memory Descriptor can be extended

with NUMA hints, to give LNet NUMA-

specific information in selecting a suitable

local interface.

Then LNet can select a remote interface for

the peer that can be reached from the local

interface.

29

Implementation Notes

Page 30: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

30 ©2015 SGI

1. Split lnet_ni

2. Local interface selection

3. Split lnet_peer

4. Ping on connect

5. Implement push

6. Peer interface selection

7. Resending on failure

8. Routing enhancements

30

Implementation Notes

Page 31: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

31 ©2015 SGI 31

Feedback & Discussion

Q&A

[email protected]

31

Page 32: Lustre Interface Bonding - EOFS€¦ · A set bit in lnet_ping_info_t::pi_features indicates multi-rail capability. 12 Peer Version Discovery ©2015 SGI 13 A simple version discovery

32 32