Download pdf - Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Exploiting and modelling topology awarecollectives

Brian Van Straalen and Edgar Solomonik

May 5, 2011

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 1/ 28




Outline

1 Introduction: Dense Linear Algebra

2 2.5D Performance

3 MulticastsTopology-aware multicasts

4 Modelling collectivesAssumptions and terminologyModel derivationsVerification of multicast model

5 Performance scalingExascale architectureCollectives at exascale2.5D algoritms at exascale

6 Conclusions and future work





”3D” Matrix Algorithms

Current standard parallel dense matrix algorithms are far fromoptimal for communication (Demmel)

PDGEMM and other BLAS 3 algorithms in ScaLAPACK useminimal memory (do not replicate the matrix elements).3D algorithms replicate the elements to achieve the maximumamount of reuse and minimize communication.

While optimal, there is an obvious limitation here...availablememory

2.5D algorithms achieve maximum reuse and minimizecommunication for any given amount of available memory.Possible to create optimal dense algorithms under machinemodels that have both data movement costs and local memorysize limitations.





2.5D matrix multiplication

0

1

2

3

0 21 3

1

2

3

0

2 03 1

0

1

21 2 0

0

10 10

0

2

32 30

10 10

0

22

+

22

33

00

11

+3D (P=64, c=4)

2.5D (P=32, c=2)

2D (P=16, c=1)

B₁₂

B₂₂

B₃₂

B₄₂

A₃₂A₃₁ A₃₃ A₃₄=





Example: 2.5D LU factorization

2. Perform TRSMs to compute a panel of L and a panel of U.

1. Factorize A₀₀ andcommunicate L₀₀ and U₀₀among layers.

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

3. Broadcast blocks so alllayers own the panelsof L and U.

(A)

(B)

4.Broadcast differentsubpanels within eachlayer.

5.Multiply subpanelson each layer.

6.Reduce (sum) thenext panels.*

U

L

7. Broadcast the panels and continue factorizing the Schur's complement...

* All layers always need to contribute to reductioneven if iteration done with subset of layers.

(C)(D)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀

2.5D algorithmspropogate dataalongdimensions...





Collectives in dense linear algebra

BLAS 3 dense algorithms perform line multicasts and linereductions

These arise naturally due to dimensional dependenciese.g. rank k updates, Schur complement updates, QR updates

ScaLAPACK algorithms map well to a 2D network torus

We show that novel 2.5D algorithms can map to a any torus

can exploit specialized line collectives





2.5D MM with different collectives

0

0.5

1

1.5

2

P=2048, 2D

P=2048, 2.5D bnm

P=2048, 2.5D rect

P=16384, 2D

P=16384, 2.5D bnm

P=16384, 2.5D rect

Tim

e no

rmal

ized

by

2D

2.5D MM vs 2D MM on BG/P (log2(n3/P)=37)

compute timeidle time

communication time





2.5D LU with different collectives

0

0.5

1

1.5

2

P=2048, 2D

P=2048, 2.5D bnm

P=2048, 2.5D rect

P=16384, 2D

P=16384, 2.5D bnm

P=16384, 2.5D rect

Tim

e no

rmal

ized

by

2D

2.5D LU vs 2D LU on BG/P (log2(n3/P)=37)

compute timeidle time

communication time





Topology-aware multicasts

Why the performance discrepancy in multicasts?

Cray machines use binomial multicastsForm spanning tree from a list of nodesRoute copies of message down each branchNetwork contention degrades utilization on a 3D torus

BG/P uses rectangular multicastsRequire network topology to be a k-ary n-cubeForm 2n edge-disjoint spanning trees

Route in different dimensional orderUse both directions of bidirectional network






2D rectangular multicasts

root2D 4X4 Torus Spanning tree 1 Spanning tree 2

Spanning tree 3 Spanning tree 4 All 4 trees combined






Another look at that first plot

Just how much better arerectangular algorithms onP = 4096 nodes?

Binomial collectives on XE6

1/30th of linkbandwidth

Rectangular collectives onBG/P

4.3X the link bandwidth

Over 120X improvementin efficiency!

128

256

512

1024

2048

4096

8192

8 64 512 4096B

andw

idth

(MB

/sec

)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5





Assumptions and terminologyModel derivationsVerification of multicast model

A basic model for topology-aware collectives

Aim to model scaling behavior of rectangular collectives

Compare with existing binomial collectives models

Assumptions

k-ary n-cube torus networkbroadcast root at corner of cubemessage is large (focus on bandwidth)LogP messaging model appliesDMA device and wormhole routing are used for messaging






Terminology

L physical network latencyo overhead of sending and receiving a messageg reciprocal of single-link bandwidthP number of processors.d dimension of network torusBn = 2d/g total node link bandwidthm message size in bytesγ flop rate of a nodeβ memory bandwidth of a node






Modelling the communication fabric

For large message size Rendezvous messaging is used

handshare is performed with an eager send (2o + L)receiver gets the data with a one-sided get (o + 2L)total cost tr = mr · g + 3o + 3L

The DMA engine can use all torus links simultaneously

L costs in each dimension can be overlappedSender overhead, o, cannot be overlapped






A model for rectangular multicasts

tmcast = m/Bn + 2(d + 1) · o + 3L + d · P1/d · (2o + L)

Our multicast model consists of 3 terms

1 m/Bn, the bandwidth cost incurred at the root

2 2(d + 1) · o + 3L, the start-up overhead of setting up themulticasts in all dimensions

3 d · P1/d · (2o + L), the path overhead reflects the time for apacket to get from the root to the farthest destination node






A model for rectangular reductions

tred = max[m/(8γ), 3m/β,m/Bn]+2(d+1)·o+3L+d ·P1/d ·(2o+L)

Any multicast tree can be inverted to produce a reduction tree

The reduction operator must be applied at each node

each node operates on 2m databoth the memory bandwidth and computation cost can beoverlapped






A model for binomial multicasts

The root of the binomial tree sends the entire messagelog2(P) times

The setup overhead is overlapped with the path overhead

We assume no contention

tbnm = log2(P) · (m/Bn + 2o + L)






Model verification: one dimension

200

400

600

800

1000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on a ring of 8 nodes of BG/P

trect modelDCMF rectangle dput

tbnm modelDCMF binomial






Model verification: two dimensions

0

500

1000

1500

2000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on 64 (8x8) nodes of BG/P

trect modelDCMF rectangle dput

tbnm modelDCMF binomial






Model verification: three dimensions

0

500

1000

1500

2000

2500

3000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on 512 (8x8x8) nodes of BG/P

trect modelKumar et al data

DCMF rectangle dputtbnm model

DCMF binomial






Rectangular reduction performance on BG/P

50

100

150

200

250

300

350

400

8 64 512

Ban

dwid

th (M

B/s

ec)

#nodes

DCMF Reduce peak bandwidth (largest message size)

torus rectangle ringtorus binomial

torus rectangletorus short binomial

BG/P rectangular reduction performs significantly worse thanmulticast






A custom 1D ring reduction implementation

Implemented with near-neighbor MPI Sends

high overhead due to re-establishment of handshakeslarge packet size required

Reduction must be software pipelined to achieve overlap

First version achieves overlap via interrupts

Second version achieves overlap via explictly managedOpenMP threads






Performance of custom line reduction

100

200

300

400

500

600

700

800

512 4096 32768

Ban

dwid

th (M

B/s

ec)

msg size (KB)

Performance of custom Reduce/Multicast on 8 nodes

MPI BroadcastCustom Ring Multicast

Custom Ring Reduce 2Custom Ring Reduce 1

MPI Reduce





Exascale architectureCollectives at exascale2.5D algoritms at exascale

A hypothetical exascale architecture

Total flop rate (γ) 1018 flop/sTotal memory 32 PBNode count (P) 262,144Node interconnect bandwidth (β) 100 GB/sLatency overhead (o) 250 nsNetwork latency (L) 250 nsTopology 3D Torus






Performance of collectives at exascale

0

20

40

60

80

100

120

140

1 8 64 512 4096 32768 262144

Ban

dwid

th (G

B/s

ec)

msg size (MB)

Exascale broadcast performance

trect 3Dtrect 2Dtrect 1Dtbnm 3Dtbnm 2Dtbnm 1D






Performance of matrix multiplication at exascale

0.4

0.5

0.6

0.7

0.8

0.9

1

1 8 64

Par

alle

l effi

cien

cy

z dimension of partition

MM strong scaling at exascale (xy plane to full xyz torus)

2.5D with rectangular (c=z)2.5D with binomial (c=z)

2D with binomial






Performance of LU at exascale

0

0.2

0.4

0.6

0.8

1

1 8 64

Par

alle

l effi

cien

cy

z dimension of partition

LU strong scaling at exascale (xy plane to full xyz torus)

2.5D with rectangular (c=z)2.5D with binomial (c=z)

2D LU with binomial






2.5D algorithms perform well by using rectangular collectives

implement pivoting into 2.5D LUextend 2.5D algorithms to SUMMA algorithm, QR, etc.incorporate 2.5D algorithms into applications (e.g. quantumchemistry)

Collectives models show rectangular algorithms scalingsuperiority

model contention within a multicast treemodel contention among concurrent multicast trees

Topology-awareness and rectangular collectives are importantat exascale

analyze different exascale network topologies


Backup slides


Insight from BLAS 3 communication lower bounds

Given local memory of size M, each processor communicates

W = Ω

(n3/P√

M

)words.

2D algorithms: M = O(n2/P)

Standard algorithms (ScaLAPACK)Use single copy of input matrices

Move O(

n2√P

)words

2.5D algorithms: M = O(c · n2/P)

Use c copies of input matrices

Move O(

n2√c·P

)words

Less communication via redundant memory usage


Virtual topology of 2.5D algorithms

(p/c)-1

i

j

k

0

00

(p/c) -11/2

1/2

c-1

2D algorithm mapping:(√

P)×(√

P)

grid

2.5D algorithm mapping:(√

P/c)×(√

P/c)× c grid for any c


2.5D matrix multiplication performance

0

10

20

30

40

50

60

70

80

25 31 37

Per

cent

age

of p

eak

log(n3/P)

2.5D MM on BG/P

P=2048, 2D MMP=2048, 2.5D MM

P=16384, 2D MMP=16384, 2.5D MM


2.5D LU performance

0

5

10

15

20

25

30

35

25 31 37

Per

cent

age

of p

eak

log(n3/P)

2.5D LU on BG/P

P=2048, 2D LUP=2048, 2.5D LU

P=16384, 2D LUP=16384, 2.5D LU


Reduction in total communication time

0

20

40

60

80

100

25 31 37

Red

uctio

n in

com

mun

icat

ion

time

(%)

log(n3/P)

2.5D vs 2D communication reduction on BG/P

P=2048, MMP=2048, LU

P=16384, MMP=16384, LU


Performance analysis

2.5D algorithms map precisely to BG/P partitions

allows usage of rectangular collectivesreduces network contention

Communication time reduced substantially (up to 92%)

mostly due to rectangular collectives

Idle time also reduced in LU

communication along critical path reduced

So rectangular collectives can indeed be used by higher levelalgorithms

Now lets take a closer look at rectangular collectives ondifferent/future architectures