Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Torus networksAlgorithms
ImplementationPerformanceConclusion
Matrix multiplication on multidimensional torusnetworks
Edgar Solomonik, and James Demmel
University of California, Berkeley
VECPAR, July 2012
Edgar Solomonik Split-Dimensional Cannon’s algorithm 1/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
Outline
Torus networksBlueGene architectureCollective communication
AlgorithmsSUMMACannon’s algorithmSD-Cannon’s algorithm
ImplementationOne-sided MPI communicationCharm++ virtualization
Performance
Conclusion
Edgar Solomonik Split-Dimensional Cannon’s algorithm 2/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
BlueGene architectureCollective communication
BlueGene/P and BlueGene/Q
Direct torus networks
I BG/P is 3D, BG/Q is 5D
I Both are bidirectional networks (6 and 10 links per node)
I Injection bandwidth sufficient to saturate all links
I Topology-aware partition allocation and collectives
Edgar Solomonik Split-Dimensional Cannon’s algorithm 3/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
BlueGene architectureCollective communication
Performance of multicast (BG/P vs Cray)
128
256
512
1024
2048
4096
8192
8 64 512 4096
Ban
dwid
th (M
B/s
ec)
#nodes
1 MB multicast on BG/P, Cray XT5, and Cray XE6
BG/PXE6XT5
Edgar Solomonik Split-Dimensional Cannon’s algorithm 4/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
BlueGene architectureCollective communication
Why the performance discrepancy in multicasts?
I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus
I BG/P uses rectangular (pipelined) multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees
I Route in different dimensional orderI Use both directions of bidirectional network
Edgar Solomonik Split-Dimensional Cannon’s algorithm 5/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
BlueGene architectureCollective communication
2D rectangular pipelined multicast trees
root2D 4X4 Torus Spanning tree 1 Spanning tree 2
Spanning tree 3 Spanning tree 4 All 4 trees combined
[Watts and Van De Geijn 95]
Edgar Solomonik Split-Dimensional Cannon’s algorithm 6/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
SUMMACannon’s algorithmSD-Cannon’s algorithm
Matrix multiplication
A
BA
B
A
B
AB
[Van De Geijn and Watts 97]
Edgar Solomonik Split-Dimensional Cannon’s algorithm 7/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
SUMMACannon’s algorithmSD-Cannon’s algorithm
SUMMA and LU with rectangular vs binomial collectives
0
10
20
30
40
50
60
70
80
MM LU LU+PVT
Per
cent
age
of m
achi
ne p
eak
Different collectives on BG/P (n=131,072, p=16,384)
binomialrectangular
Edgar Solomonik Split-Dimensional Cannon’s algorithm 8/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
SUMMACannon’s algorithmSD-Cannon’s algorithm
Cannon’s algorithmA B
Stagger left
A[i,j] := A[i,j+1]
Shift right
A[i,j] := A[i,j-1]
Starting position
Stagger up
B[i,j] := B[i+1,j]
Shift down
B[i,j] := B[i-1,j]
Starting position
...
[Cannon 69]
Edgar Solomonik Split-Dimensional Cannon’s algorithm 9/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
SUMMACannon’s algorithmSD-Cannon’s algorithm
Cannon’s algorithm
Advantages over SUMMAI Uses only near-neighbor sends rather than multicasts
I lower latency cost
I Can be done in-place given near-neighbor data-swaps
Disadvantages with respect to SUMMA
I Does not generalize well to non-square processor grids
I Cannot exploit multiple links via rectangular multicasts
Edgar Solomonik Split-Dimensional Cannon’s algorithm 10/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
SUMMACannon’s algorithmSD-Cannon’s algorithm
Split-dimensional Cannon’s algorithm
Improves over Cannon by saturating all network links
I Subdivide whole multiply into 2n block outer products
I Use bidirectional links by shifting half the outer products inopposite direction
I Perform each outer product in a different dimensional order
I Accumulation of outer-products into one buffer allows forsame-sized local multiplications as pure Cannon’s algorithm
I Does not require pipelined multicasts
Edgar Solomonik Split-Dimensional Cannon’s algorithm 11/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
SUMMACannon’s algorithmSD-Cannon’s algorithm
Split-dimensional Cannon’s algorithm
SD-Cannon on a 3-ary 6-cube
dim 1
dim 2
dim 3
Each circle corresponds to a shift along a dimension
Each color corresponds to an outer product
Edgar Solomonik Split-Dimensional Cannon’s algorithm 12/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
One-sided MPI communicationCharm++ virtualization
MPI implementation
SD-Cannon parallel implementation using MPI
I All communication done with near-neighbor one-sided puts
I Code is simple (200 lines)
I Limited to square (k-ary n-cube) processor grids
Edgar Solomonik Split-Dimensional Cannon’s algorithm 13/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
One-sided MPI communicationCharm++ virtualization
Charm++
Charm++ is an asynchronous dynamic runtime system
I Provides object-based (chares) virtualization (decouples fromprocess grid)
I Message-directed task invocation
I Allows topology-aware task mapping
I Provides additional features such as dynamic load balancingand performance profiling
Edgar Solomonik Split-Dimensional Cannon’s algorithm 14/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
One-sided MPI communicationCharm++ virtualization
Virtual topology via Charm++
Implemented Cannon and SD-Cannon in Charm++
I Can map to any torus process topology
I Code more complex, but not significantly so
I Not using one-sided communication (though possible viaCkDirect)
I Virtualization lowers task granularity
Edgar Solomonik Split-Dimensional Cannon’s algorithm 15/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
2D BlueGene/P performance
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1024 2048 4096 8192 16384
Frac
tion
of th
eore
tical
pea
k
matrix dimension
Performance on 8x8 torus partition of BG/P
SUMMAMPI SD-Cannon
Charm++ SD-CannonMPI Cannon
Charm++ Cannon
Edgar Solomonik Split-Dimensional Cannon’s algorithm 16/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
3D BlueGene/P performance
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
4096 8192 16384 32768 65536
Frac
tion
of th
eore
tical
pea
k
matrix dimension
Performance on 8x8x8 torus partition of BG/P
SUMMACharm++ Cannon VN
Charm++ SD-Cannon VNCharm++ Cannon SMP
Charm++ SD-Cannon SMP
Edgar Solomonik Split-Dimensional Cannon’s algorithm 17/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
Preliminary 5D BlueGene/Q performance
12
14
16
18
20
22
24
26
28
30
8192 16384 32768 65536
Tera
flops
matrix dimension
Performance on 2x4x4x2x4 torus partition of BG/Q
SUMMAMPI SD-Cannon
MPI Cannon
Edgar Solomonik Split-Dimensional Cannon’s algorithm 18/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
K computer Tofu network
T - Torus M - Mesh
M
M
T
TTT
UnitNode
10 links per node / 4 torus dimensions / 2 mesh dimensions of length 2
Edgar Solomonik Split-Dimensional Cannon’s algorithm 19/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
Tofu network is 5D not 6D
=
A two-by-two mesh is a 1D ring of length 4
Edgar Solomonik Split-Dimensional Cannon’s algorithm 20/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
SD-Cannon K computer potential
5D K computer torus different from 5D BG/Q
I K computer injection bandwidth can saturate only 4/10 links
SD-Cannon could still be beneficial
I Can saturate 4 links rather than 2 (up to 2X speed-up)
I Does not require pipelined broadcast implementation
Edgar Solomonik Split-Dimensional Cannon’s algorithm 21/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
Conclusion
SD-Cannon
I Breaches performance gap between Cannon and SUMMA
I Is uniquely asymptotically communication-optimal on a k-aryn-cube
I Virtualization allows general mapping support but incursoverhead
Topology-aware mapping and algorithm design
I Allows zero network contention
I Permits saturation of much more bandwidth on torus networks
I Pervasive for parallel scalability on high-end supercomputers
Edgar Solomonik Split-Dimensional Cannon’s algorithm 22/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
Finish
Acknowledgements:
I Krell CSGF DOE fellowship (DE-AC02-06CH11357)
I Resources at Argonne National Lab
I Anonymous VECPAR reviewersI Discussions with collaborators
I James Demmel (UC Berkeley)I Jeff Hammond (Argonne National Lab)I Grey Ballard (UC Berkeley)
Also, see my talk on 2.5D algorithms at University of Tokyo
I Tuesday, July 24, 10:00-12:00
Edgar Solomonik Split-Dimensional Cannon’s algorithm 23/ 23
Backup slides
Edgar Solomonik Split-Dimensional Cannon’s algorithm 24/ 23