1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory...

Preview:

Citation preview

1

Scaling Collective Multicast Fat-tree

Networks

Sameer KumarParallel Programming

LaboratoryUniversity Of Illinois at Urbana

ChampaignICPADS’04

07/07/04 ICPADS’04 2

Collective Communication Communication operation in which

all or a large subset participate For example broadcast

Performance impediment All to all communication

All to all personalized communication (AAPC)

All to all multicast (AAM)

07/07/04 ICPADS’04 3

Communication Model Overhead of a point to point message is

Tp2p = α + mβ

α is the total software overhead of sending the message

β is the per byte network overhead

m is the size of the message

Direct all to all overhead TAAM = (P – 1) × (α + mβ) α domain when m is small β domain when m is large

07/07/04 ICPADS’04 4

Optimization Strategies

Short messages Parameter α

dominates Message combining

Reduce the total number of messages

Multistage algorithm to send messages along a virtual topology

Large messages Parameter β

dominates Network contention Network topology

specific optimizations that minimize network contention

07/07/04 ICPADS’04 5

Direct Strategies Direct strategies optimize all to all

multicast for large messages Minimize network contention Topology specific optimizations that take

advantage of contention free schedules

07/07/04 ICPADS’04 6

Fat-tree Networks

Popular network topology for clusters

Bisection bandwidth O(P) Network scales to several

thousands of nodes Topology: k-ary,n-tree

07/07/04 ICPADS’04 7

k-ary n-trees

c)

4-ary 3 tree

a)

4-ary 1-tree

b)

4-ary 2-tree

07/07/04 ICPADS’04 8

Contention Free Permutations

Fat-trees have a nice property:some processor permutations are contention free Prefix permutation k

Processor i sends data to

Cyclic shift by k Processor i sends a message to

Contention free if

Contention free permutations presented in Heller et. al. from CM-5

ki

Pki )%(

0,3,2,1,4 jaak j

07/07/04 ICPADS’04 9

Prefix Permutation 1

0 1 2 3 4 5 6 7

Prefix Permutation by 1Processor p sends to p XOR 1

07/07/04 ICPADS’04 10

Prefix Permutation 2

0 1 2 3 4 5 6 7

Prefix Permutation by 2Processor p sends to p XOR 2

07/07/04 ICPADS’04 11

Prefix Permutation 3

0 1 2 3 4 5 6 7

Prefix Permutation by 3Processor p sends to p XOR 3

07/07/04 ICPADS’04 12

Prefix Permutation 4 …

0 1 2 3 4 5 6 7

Prefix Permutation by 4Processor p sends to p XOR 4

07/07/04 ICPADS’04 13

Cyclic Shift by k

0 1 2 3 4 5 6 7

Cyclic Shift by 2

07/07/04 ICPADS’04 14

Quadrics: HPC Interconnect Popular interconnect

Several in top500 use quadrics Used by Pittsburgh’s Lemieux (6TF) and

ASCI-Q (20TF) Features

Low latency (5 μs for MPI) High bandwidth (320MB/s/node) Fat tree topology Scales to 2K nodes

07/07/04 ICPADS’04 15

Effect of Contention of Throughput

100

150

200

250

300

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256

k

Nod

e B

andw

idth

(M

B/s

) Cyclic Shift

Prefix Send

Cyclic Shift(Main Memory)

Drop in bandwidth at k=4,16,64

Node Bandwidth Kth Permutation (MB/s)

Sending data from main memory is much slower

07/07/04 ICPADS’04 16

Performance Bottlenecks 320 byte packet size

Packet protocol restricts bandwidth to faraway nodes

PCI/DMA bandwidth is restrictive Achievable bandwidth is only 128MB/s

07/07/04 ICPADS’04 17

Quadrics Packet Protocol

Nearby NodesFull Utilization

Send the first packet

Sender Receiver

Ack Header

Receive Ack

Send Header

Send Payload

Send the next packet

after first has been acked.

07/07/04 ICPADS’04 18

Far Away MessagesSend the first packet

Sender Receiver

Ack Header

Receive Ack

Send Header

Send Payload

Send the next packet

Faraway NodesLow Utilization

07/07/04 ICPADS’04 19

AAM on Fat-tree Networks Overcome bottlenecks

Messages sent from NIC memory have 2.5 times better performance

Avoid sending messages to far away nodes

Using contention free permutations Permutation: every processor sends a

message to a different destination

07/07/04 ICPADS’04 20

AAM Strategy: Ring Performs all to all multicast by sending

messages along a ring formed by the processors Equivalent to P-1 cyclic-shift-by-1 operations Congestion free Has appeared in literature before

Drawback Processors send different messages in each step

0 1 2 i i+1 P-1

…… ……..

07/07/04 ICPADS’04 21

Prefix Send Strategy P-1 prefix permutations

In stage j, processor i sends a message to processor (i XOR (j+1))

Congestion free Can send messages from Elan memory Bad performance on large fat-trees

Sends P/2 messages to far-away nodes at distance P/2 or more away

Wire/Switch delays restrict performance

07/07/04 ICPADS’04 22

K-Prefix Strategy Hybrid of ring strategy and prefix

send Prefix send used in partitions of size k Ring used between the partitions Our contribution!

0 1 2 i i+1 P-1

…… ……..

Ring across fat-trees of size k

Prefix Send within

07/07/04 ICPADS’04 23

PerformanceCollective Multicast Performance (128 Nodes)

10

100

10000 100000Message Size (bytes)

Com

plet

ion

Tim

e (m

s)

MPI prefix-send k-prefix ring

Node bandwidth (MB/s) each way

Nodes

MPI Prefix K-Prefix

64 123 260 265

128 99 224 259

144 94 - 261

256 95 215 256

Our strategies send messages from Elan memory

07/07/04 ICPADS’04 24

Cost Equation

α , host and network software overhead αb, cost of barrier (barriers needed to

synchronize the nodes) βem, per byte network transmission cost δ, copying overhead to NIC memory P, Number of processors k, Size of the partition in k-Prefix

mkPmPT embprefixk )/())(1(

07/07/04 ICPADS’04 25

K-Prefixlb Strategy

k-Prefixlb strategy synchronizes nodes

after a few steps

AAM Performance (128 Nodes)

1

10

100

10000 100000Message Size (bytes)

Com

plet

ion

Tim

e (m

s)

MPI k-prefix k-prefixlb k-prefixlb-cpu

07/07/04 ICPADS’04 26

CPU Overhead

Strategies should also be evaluated on compute overhead

Asynchronous non blocking primitives needed

A data driven system like Charm++ will automatically support this

07/07/04 ICPADS’04 27

Predicted vs Actual Performance

k-Prefix Performance (128 Nodes)

1

10

100

10000 100000Message Size (bytes)

Co

mp

leti

on

Tim

e (m

s)

k-prefix K-Prefix Predicted

Predicted plot assumes: α = 9us, αb= 15us, β = δ = 294MB/s

07/07/04 ICPADS’04 28

Missing Nodes

• Missing nodes due to down nodes in the fat tree

• Prefix-Send and k-Prefix do badly in this scenario 173-69240

16915872128

K-PrefixPrefix-SendMPINodes

Node bandwidth with 1 missing node

07/07/04 ICPADS’04 29

K-Shift Strategy Processor i sends data to the consecutive

nodes [i-k/2+1,…, i-1, i+1,…, i+k/2] and to i+k

Contention free and good performance with non-contiguous nodes, when k=8

Our contribution

173197240

169196128

K-PrefixK-ShiftNodesNode bandwidth (MB/s) with one

missing node

0 i+k/2i-k/2+1 i-1 i P-1

…… … ……i+k

K-shift gains because most of the destinations for each node do not change in the presence

of missing nodes

07/07/04 ICPADS’04 30

Conclusion We optimize AAM for Quadrics QsNet

Copying and sending a message from the NIC has more bandwidth

K-Prefix avoids sending messages to far away nodes

Handle missing nodes through the k-shift strategy Cluster interconnects other than quadrics

also have such problems Impressive performance results CPU overhead should be a metric to evaluate

AAM strategies