A weighted fat-tree routing algorithm for efficient load

Preview:

Citation preview

Feroz Zahid, Ernst Gunnar Gran, Tor Skeie Simula Research Laboratory, Norway Bartosz Bogdanksi, BjØrn Dag Johnsen Oracle Corporation

PDP 2015, Turku, Finland

March 5, 2015

A weighted fat-tree routing algorithm for efficient load-balancing in InfiniBand clusters

InfiniBand (IB) is a popular interconnect for HPC systems

Source: Top500 Supercomputers List, http://top500.org/

44.8% share in November 2014 top supercomputers list

Network performance in HPC systems depends on three important factors

Routing

Network Topology

Traffic Patterns

Many different topologies are found in real-world clusters Ring, Kautz, Torus, Clos, Fat-trees

Fat-tree and its variants are very common in IB networks

• k-ary-n-tree • n levels, 𝑘𝑘𝑛𝑛 nodes n . 𝑘𝑘𝑛𝑛−1 switches • 2k ports on each switch • Each switch has equal number of up and down connections • Only half of the ports of the root switches are used

• XGFTs • More generalized • Allows different number of up and down connections on switches • Also, allows different number of connections at each level

• PGFTs • Allows multiple connecting links between switches

• RLFTs • Restrictions on PGFTs • Same port switches at all levels

Maintenance of full-bisection bandwidth

A B

Easy deadlock-free Routing

Fault Tolerance

Fat-trees have nice properties that make them popular

Up Down

Routing in IB networks is generally deterministic

Based on linear forwarding tables (LFTs) stored in the switches

Deterministic routing is traffic oblivious!

Routing in fat-tree networks can be source based or destination based, and can be closed form or iterative

• Source-based • Out-port for a packet at a switch based on source node identifier

• Destination-based • Out-port for a packet at a switch based on destination node identifier

• Closed form • D-mod-K, S-mod-K

• Iterative

for each leaf switch lf for each node connected to lf id <= node identifier route_downgoing_go_up(id) ... end for end for

OFED’s fat-tree routing algorithm tends to spread the routes across the tree using counters

Ref: Zahavi, Eitan, et al. "Optimized InfiniBand fat-tree routing for shift all-to-all communication patterns." Concurrency and Computation: Practice and Experience 22.2 (2010): 217-231.

OFED is the de-facto standard software stack for building and deploying IB based applications

• Deterministic • High-performance, Avoids out-of-order packet deliveries

• Destination-based • Direct realization in IB networks

• Iterative • Better routes balancing

• Maintains counters on ports • When a new route is added - +1

• Supports XGFTs, PGFTs, RLFTs

“Multi-stage switches are not cross-bars!”

The effective bisection-bandwidth depends on the traffic pattern

Ref: Hoefler, Torsten, Timo Schneider, and Andrew Lumsdaine. "Multistage switches are not crossbars: Effects of static routing in high-performance networks." Cluster Computing, 2008

“Multi-stage switches are not cross-bars!”

The effective bisection-bandwidth depends on the traffic pattern

Ref: Hoefler, Torsten, Timo Schneider, and Andrew Lumsdaine. "Multistage switches are not crossbars: Effects of static routing in high-performance networks." Cluster Computing, 2008

“Multi-stage switches are not cross-bars!”

The effective bisection-bandwidth depends on the traffic pattern

Ref: Hoefler, Torsten, Timo Schneider, and Andrew Lumsdaine. "Multistage switches are not crossbars: Effects of static routing in high-performance networks." Cluster Computing, 2008

Node 1 and 4 share same index position in their leaf switches

We identify two important issues with the fat-tree routing algorithm as implemented by OFED’s subnet manager

• Node Traffic Oblivious Routing • All nodes treated equally • Node roles ignored

• Non-predictable Performance • Node are routed in an order that depends on the port numbers • Port numbering is hard to set

• Sysadmins do not care about it • Addition of new nodes

• Which nodes share links? • Depends on the indexing sequence!

Some nodes tends to receive more traffic than others, so routes towards those nodes are more likely to be congested Node 4 and 5 are more likely to receive traffic e.g. storage nodes

Some nodes tends to receive more traffic than others, so routes towards those nodes are more likely to be congested Node 4 and 5 are more likely to receive traffic e.g. storage nodes

Some nodes tends to receive more traffic than others, so routes towards those nodes are more likely to be congested

We call these nodes receiver nodes!

Node 4 and 5 are more likely to receive traffic e.g. storage nodes

648-port fat-tree is a common building block for HPC systems

Result: The probability of index collision for receiver nodes is very high for node oblivious routing

Probability of about 90% that two receiver nodes will share the same index for 2 rcv/switch !

The weighted fat-tree routing algorithm (wFatTree) assigns weights to the nodes

The algorithm is still deterministic!

• All compute nodes are assigned a new parameter • receive weight

• Weights can be assigned based on • Known node roles e.g. storage nodes • Known traffic priorities e.g. following QoS levels • Traffic profiling

• Nodes are routed in the decreasing order of their weights • Not based on port numbering • Predictable

• Port selection is based on both • Downward weight • Upward weight

Port selection in wFatTree uses both downward and upward weights

Result: Evaluation on 648-port fat-tree shows substantial improvements in total network bandwidth

18 Switches with receiver nodes

27 Switches with receiver nodes

Result: Evaluation on 648-port fat-tree shows substantial improvements in total network bandwidth

All 36 Switches with receiver nodes

Result: wFatTree minimizes the total contention on the links by routes balancing

Result: wFatTree minimizes the total contention on the links by routes balancing

Result: The wFatTree execution time is competitive to the original fat tree routing

Topology No. of End Nodes Fat Tree Routing wFatTree Routing

4-ary-2-tree 16 0.167 0.255

8-ary-2-tree 64 0.318 0.365

16-ary-2-tree 256 1.686 2.268

8-ary-3-tree 512 16.386 19.657

12-ary-3-tree 1728 188.856 230.639

16-ary-3-tree 4096 1029.369 1434.287

Future Work: Enable smart network provisioning – Four important components

Nodes with weights

Balanced Traffic Better Routes

Optimized Algorithms

Smart Routing Reconfiguration Load Balancing Congestion Control

IB Congestion Control

Performance

Adjusting to Load

Optimization

Monitor->Optimize->Execute Loop

Questions?

State-of-the fat-tree routing with oblivious path assignment

The weighted fat-tree routing with

better load-balancing

In summary, weighted fat-tree routing improves actual load-balancing in IB based fat-tree networks

Recommended