24
Torus networks Algorithms Implementation Performance Conclusion Matrix multiplication on multidimensional torus networks Edgar Solomonik, and James Demmel University of California, Berkeley VECPAR, July 2012 Edgar Solomonik Split-Dimensional Cannon’s algorithm 1/ 23

Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

Matrix multiplication on multidimensional torusnetworks

Edgar Solomonik, and James Demmel

University of California, Berkeley

VECPAR, July 2012

Edgar Solomonik Split-Dimensional Cannon’s algorithm 1/ 23

Page 2: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

Outline

Torus networksBlueGene architectureCollective communication

AlgorithmsSUMMACannon’s algorithmSD-Cannon’s algorithm

ImplementationOne-sided MPI communicationCharm++ virtualization

Performance

Conclusion

Edgar Solomonik Split-Dimensional Cannon’s algorithm 2/ 23

Page 3: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

BlueGene architectureCollective communication

BlueGene/P and BlueGene/Q

Direct torus networks

I BG/P is 3D, BG/Q is 5D

I Both are bidirectional networks (6 and 10 links per node)

I Injection bandwidth sufficient to saturate all links

I Topology-aware partition allocation and collectives

Edgar Solomonik Split-Dimensional Cannon’s algorithm 3/ 23

Page 4: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

BlueGene architectureCollective communication

Performance of multicast (BG/P vs Cray)

128

256

512

1024

2048

4096

8192

8 64 512 4096

Ban

dwid

th (M

B/s

ec)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5

Edgar Solomonik Split-Dimensional Cannon’s algorithm 4/ 23

Page 5: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

BlueGene architectureCollective communication

Why the performance discrepancy in multicasts?

I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus

I BG/P uses rectangular (pipelined) multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees

I Route in different dimensional orderI Use both directions of bidirectional network

Edgar Solomonik Split-Dimensional Cannon’s algorithm 5/ 23

Page 6: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

BlueGene architectureCollective communication

2D rectangular pipelined multicast trees

root2D 4X4 Torus Spanning tree 1 Spanning tree 2

Spanning tree 3 Spanning tree 4 All 4 trees combined

[Watts and Van De Geijn 95]

Edgar Solomonik Split-Dimensional Cannon’s algorithm 6/ 23

Page 7: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

SUMMACannon’s algorithmSD-Cannon’s algorithm

Matrix multiplication

A

BA

B

A

B

AB

[Van De Geijn and Watts 97]

Edgar Solomonik Split-Dimensional Cannon’s algorithm 7/ 23

Page 8: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

SUMMACannon’s algorithmSD-Cannon’s algorithm

SUMMA and LU with rectangular vs binomial collectives

0

10

20

30

40

50

60

70

80

MM LU LU+PVT

Per

cent

age

of m

achi

ne p

eak

Different collectives on BG/P (n=131,072, p=16,384)

binomialrectangular

Edgar Solomonik Split-Dimensional Cannon’s algorithm 8/ 23

Page 9: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

SUMMACannon’s algorithmSD-Cannon’s algorithm

Cannon’s algorithmA B

Stagger left

A[i,j] := A[i,j+1]

Shift right

A[i,j] := A[i,j-1]

Starting position

Stagger up

B[i,j] := B[i+1,j]

Shift down

B[i,j] := B[i-1,j]

Starting position

...

[Cannon 69]

Edgar Solomonik Split-Dimensional Cannon’s algorithm 9/ 23

Page 10: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

SUMMACannon’s algorithmSD-Cannon’s algorithm

Cannon’s algorithm

Advantages over SUMMAI Uses only near-neighbor sends rather than multicasts

I lower latency cost

I Can be done in-place given near-neighbor data-swaps

Disadvantages with respect to SUMMA

I Does not generalize well to non-square processor grids

I Cannot exploit multiple links via rectangular multicasts

Edgar Solomonik Split-Dimensional Cannon’s algorithm 10/ 23

Page 11: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

SUMMACannon’s algorithmSD-Cannon’s algorithm

Split-dimensional Cannon’s algorithm

Improves over Cannon by saturating all network links

I Subdivide whole multiply into 2n block outer products

I Use bidirectional links by shifting half the outer products inopposite direction

I Perform each outer product in a different dimensional order

I Accumulation of outer-products into one buffer allows forsame-sized local multiplications as pure Cannon’s algorithm

I Does not require pipelined multicasts

Edgar Solomonik Split-Dimensional Cannon’s algorithm 11/ 23

Page 12: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

SUMMACannon’s algorithmSD-Cannon’s algorithm

Split-dimensional Cannon’s algorithm

SD-Cannon on a 3-ary 6-cube

dim 1

dim 2

dim 3

Each circle corresponds to a shift along a dimension

Each color corresponds to an outer product

Edgar Solomonik Split-Dimensional Cannon’s algorithm 12/ 23

Page 13: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

One-sided MPI communicationCharm++ virtualization

MPI implementation

SD-Cannon parallel implementation using MPI

I All communication done with near-neighbor one-sided puts

I Code is simple (200 lines)

I Limited to square (k-ary n-cube) processor grids

Edgar Solomonik Split-Dimensional Cannon’s algorithm 13/ 23

Page 14: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

One-sided MPI communicationCharm++ virtualization

Charm++

Charm++ is an asynchronous dynamic runtime system

I Provides object-based (chares) virtualization (decouples fromprocess grid)

I Message-directed task invocation

I Allows topology-aware task mapping

I Provides additional features such as dynamic load balancingand performance profiling

Edgar Solomonik Split-Dimensional Cannon’s algorithm 14/ 23

Page 15: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

One-sided MPI communicationCharm++ virtualization

Virtual topology via Charm++

Implemented Cannon and SD-Cannon in Charm++

I Can map to any torus process topology

I Code more complex, but not significantly so

I Not using one-sided communication (though possible viaCkDirect)

I Virtualization lowers task granularity

Edgar Solomonik Split-Dimensional Cannon’s algorithm 15/ 23

Page 16: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

2D BlueGene/P performance

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1024 2048 4096 8192 16384

Frac

tion

of th

eore

tical

pea

k

matrix dimension

Performance on 8x8 torus partition of BG/P

SUMMAMPI SD-Cannon

Charm++ SD-CannonMPI Cannon

Charm++ Cannon

Edgar Solomonik Split-Dimensional Cannon’s algorithm 16/ 23

Page 17: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

3D BlueGene/P performance

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

4096 8192 16384 32768 65536

Frac

tion

of th

eore

tical

pea

k

matrix dimension

Performance on 8x8x8 torus partition of BG/P

SUMMACharm++ Cannon VN

Charm++ SD-Cannon VNCharm++ Cannon SMP

Charm++ SD-Cannon SMP

Edgar Solomonik Split-Dimensional Cannon’s algorithm 17/ 23

Page 18: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

Preliminary 5D BlueGene/Q performance

12

14

16

18

20

22

24

26

28

30

8192 16384 32768 65536

Tera

flops

matrix dimension

Performance on 2x4x4x2x4 torus partition of BG/Q

SUMMAMPI SD-Cannon

MPI Cannon

Edgar Solomonik Split-Dimensional Cannon’s algorithm 18/ 23

Page 19: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

K computer Tofu network

T - Torus M - Mesh

M

M

T

TTT

UnitNode

10 links per node / 4 torus dimensions / 2 mesh dimensions of length 2

Edgar Solomonik Split-Dimensional Cannon’s algorithm 19/ 23

Page 20: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

Tofu network is 5D not 6D

=

A two-by-two mesh is a 1D ring of length 4

Edgar Solomonik Split-Dimensional Cannon’s algorithm 20/ 23

Page 21: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

SD-Cannon K computer potential

5D K computer torus different from 5D BG/Q

I K computer injection bandwidth can saturate only 4/10 links

SD-Cannon could still be beneficial

I Can saturate 4 links rather than 2 (up to 2X speed-up)

I Does not require pipelined broadcast implementation

Edgar Solomonik Split-Dimensional Cannon’s algorithm 21/ 23

Page 22: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

Conclusion

SD-Cannon

I Breaches performance gap between Cannon and SUMMA

I Is uniquely asymptotically communication-optimal on a k-aryn-cube

I Virtualization allows general mapping support but incursoverhead

Topology-aware mapping and algorithm design

I Allows zero network contention

I Permits saturation of much more bandwidth on torus networks

I Pervasive for parallel scalability on high-end supercomputers

Edgar Solomonik Split-Dimensional Cannon’s algorithm 22/ 23

Page 23: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Torus networksAlgorithms

ImplementationPerformanceConclusion

Finish

Acknowledgements:

I Krell CSGF DOE fellowship (DE-AC02-06CH11357)

I Resources at Argonne National Lab

I Anonymous VECPAR reviewersI Discussions with collaborators

I James Demmel (UC Berkeley)I Jeff Hammond (Argonne National Lab)I Grey Ballard (UC Berkeley)

Also, see my talk on 2.5D algorithms at University of Tokyo

I Tuesday, July 24, 10:00-12:00

Edgar Solomonik Split-Dimensional Cannon’s algorithm 23/ 23

Page 24: Matrix multiplication on multidimensional torus networks · 2020. 12. 8. · One-sided MPI communication Charm++ virtualization MPI implementation SD-Cannon parallel implementation

Backup slides

Edgar Solomonik Split-Dimensional Cannon’s algorithm 24/ 23