Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

Matrix multiplication on multidimensional torusnetworks

Edgar Solomonik, and James Demmel

University of California, Berkeley

VECPAR, July 2012

Edgar Solomonik Split-Dimensional Cannon’s algorithm 1/ 23



Outline

Torus networksBlueGene architectureCollective communication

AlgorithmsSUMMACannon’s algorithmSD-Cannon’s algorithm

ImplementationOne-sided MPI communicationCharm++ virtualization

Performance

Conclusion




BlueGene architectureCollective communication

BlueGene/P and BlueGene/Q

Direct torus networks

I BG/P is 3D, BG/Q is 5D

I Both are bidirectional networks (6 and 10 links per node)

I Injection bandwidth sufficient to saturate all links

I Topology-aware partition allocation and collectives





Performance of multicast (BG/P vs Cray)

128

256

512

1024

2048

4096

8192

8 64 512 4096

Ban

dwid

th (M

B/s

ec)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5





Why the performance discrepancy in multicasts?

I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus

I BG/P uses rectangular (pipelined) multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees

I Route in different dimensional orderI Use both directions of bidirectional network





2D rectangular pipelined multicast trees

root2D 4X4 Torus Spanning tree 1 Spanning tree 2

Spanning tree 3 Spanning tree 4 All 4 trees combined

[Watts and Van De Geijn 95]




SUMMACannon’s algorithmSD-Cannon’s algorithm

Matrix multiplication

A

BA

B

A

B

AB

[Van De Geijn and Watts 97]





SUMMA and LU with rectangular vs binomial collectives

0

10

20

30

40

50

60

70

80

MM LU LU+PVT

Per

cent

age

of m

achi

ne p

eak

Different collectives on BG/P (n=131,072, p=16,384)

binomialrectangular





Cannon’s algorithmA B

Stagger left

A[i,j] := A[i,j+1]

Shift right

A[i,j] := A[i,j-1]

Starting position

Stagger up

B[i,j] := B[i+1,j]

Shift down

B[i,j] := B[i-1,j]

Starting position

...

[Cannon 69]





Cannon’s algorithm

Advantages over SUMMAI Uses only near-neighbor sends rather than multicasts

I lower latency cost

I Can be done in-place given near-neighbor data-swaps

Disadvantages with respect to SUMMA

I Does not generalize well to non-square processor grids

I Cannot exploit multiple links via rectangular multicasts





Split-dimensional Cannon’s algorithm

Improves over Cannon by saturating all network links

I Subdivide whole multiply into 2n block outer products

I Use bidirectional links by shifting half the outer products inopposite direction

I Perform each outer product in a different dimensional order

I Accumulation of outer-products into one buffer allows forsame-sized local multiplications as pure Cannon’s algorithm

I Does not require pipelined multicasts





Split-dimensional Cannon’s algorithm

SD-Cannon on a 3-ary 6-cube

dim 1

dim 2

dim 3

Each circle corresponds to a shift along a dimension

Each color corresponds to an outer product




One-sided MPI communicationCharm++ virtualization

MPI implementation

SD-Cannon parallel implementation using MPI

I All communication done with near-neighbor one-sided puts

I Code is simple (200 lines)

I Limited to square (k-ary n-cube) processor grids





Charm++

Charm++ is an asynchronous dynamic runtime system

I Provides object-based (chares) virtualization (decouples fromprocess grid)

I Message-directed task invocation

I Allows topology-aware task mapping

I Provides additional features such as dynamic load balancingand performance profiling





Virtual topology via Charm++

Implemented Cannon and SD-Cannon in Charm++

I Can map to any torus process topology

I Code more complex, but not significantly so

I Not using one-sided communication (though possible viaCkDirect)

I Virtualization lowers task granularity




2D BlueGene/P performance

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1024 2048 4096 8192 16384

Frac

tion

of th

eore

tical

pea

k

matrix dimension

Performance on 8x8 torus partition of BG/P

SUMMAMPI SD-Cannon

Charm++ SD-CannonMPI Cannon

Charm++ Cannon




3D BlueGene/P performance

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

4096 8192 16384 32768 65536

Frac

tion

of th

eore

tical

pea

k

matrix dimension

Performance on 8x8x8 torus partition of BG/P

SUMMACharm++ Cannon VN

Charm++ SD-Cannon VNCharm++ Cannon SMP

Charm++ SD-Cannon SMP




Preliminary 5D BlueGene/Q performance

12

14

16

18

20

22

24

26

28

30

8192 16384 32768 65536

Tera

flops

matrix dimension

Performance on 2x4x4x2x4 torus partition of BG/Q

SUMMAMPI SD-Cannon

MPI Cannon




K computer Tofu network

T - Torus M - Mesh

M

M

T

TTT

UnitNode

10 links per node / 4 torus dimensions / 2 mesh dimensions of length 2




Tofu network is 5D not 6D

=

A two-by-two mesh is a 1D ring of length 4




SD-Cannon K computer potential

5D K computer torus different from 5D BG/Q

I K computer injection bandwidth can saturate only 4/10 links

SD-Cannon could still be beneficial

I Can saturate 4 links rather than 2 (up to 2X speed-up)

I Does not require pipelined broadcast implementation




Conclusion

SD-Cannon

I Breaches performance gap between Cannon and SUMMA

I Is uniquely asymptotically communication-optimal on a k-aryn-cube

I Virtualization allows general mapping support but incursoverhead

Topology-aware mapping and algorithm design

I Allows zero network contention

I Permits saturation of much more bandwidth on torus networks

I Pervasive for parallel scalability on high-end supercomputers




Finish

Acknowledgements:

I Krell CSGF DOE fellowship (DE-AC02-06CH11357)

I Resources at Argonne National Lab

I Anonymous VECPAR reviewersI Discussions with collaborators

I James Demmel (UC Berkeley)I Jeff Hammond (Argonne National Lab)I Grey Ballard (UC Berkeley)

Also, see my talk on 2.5D algorithms at University of Tokyo

I Tuesday, July 24, 10:00-12:00


Backup slides


Documents

Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation