Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Exploiting and modelling topology awarecollectives
Brian Van Straalen and Edgar Solomonik
May 5, 2011
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 1/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Outline
1 Introduction: Dense Linear Algebra
2 2.5D Performance
3 MulticastsTopology-aware multicasts
4 Modelling collectivesAssumptions and terminologyModel derivationsVerification of multicast model
5 Performance scalingExascale architectureCollectives at exascale2.5D algoritms at exascale
6 Conclusions and future work
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 2/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
”3D” Matrix Algorithms
Current standard parallel dense matrix algorithms are far fromoptimal for communication (Demmel)
PDGEMM and other BLAS 3 algorithms in ScaLAPACK useminimal memory (do not replicate the matrix elements).3D algorithms replicate the elements to achieve the maximumamount of reuse and minimize communication.
While optimal, there is an obvious limitation here...availablememory
2.5D algorithms achieve maximum reuse and minimizecommunication for any given amount of available memory.Possible to create optimal dense algorithms under machinemodels that have both data movement costs and local memorysize limitations.
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 3/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
2.5D matrix multiplication
0
1
2
3
0 21 3
1
2
3
0
2 03 1
0
1
21 2 0
0
10 10
0
2
32 30
10 10
0
22
+
22
33
00
11
+3D (P=64, c=4)
2.5D (P=32, c=2)
2D (P=16, c=1)
B₁₂
B₂₂
B₃₂
B₄₂
A₃₂A₃₁ A₃₃ A₃₄=
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 4/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Example: 2.5D LU factorization
2. Perform TRSMs to compute a panel of L and a panel of U.
1. Factorize A₀₀ andcommunicate L₀₀ and U₀₀among layers.
L₀₀
U₀₀
U₀₃
U₀₃
U₀₁
L₂₀L₃₀
L₁₀
3. Broadcast blocks so alllayers own the panelsof L and U.
(A)
(B)
4.Broadcast differentsubpanels within eachlayer.
5.Multiply subpanelson each layer.
6.Reduce (sum) thenext panels.*
U
L
7. Broadcast the panels and continue factorizing the Schur's complement...
* All layers always need to contribute to reductioneven if iteration done with subset of layers.
(C)(D)
U₀₀
U₀₀
L₀₀
L₀₀
U₀₀
L₀₀
2.5D algorithmspropogate dataalongdimensions...
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 5/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Collectives in dense linear algebra
BLAS 3 dense algorithms perform line multicasts and linereductions
These arise naturally due to dimensional dependenciese.g. rank k updates, Schur complement updates, QR updates
ScaLAPACK algorithms map well to a 2D network torus
We show that novel 2.5D algorithms can map to a any torus
can exploit specialized line collectives
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 6/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
2.5D MM with different collectives
0
0.5
1
1.5
2
P=2048, 2D
P=2048, 2.5D bnm
P=2048, 2.5D rect
P=16384, 2D
P=16384, 2.5D bnm
P=16384, 2.5D rect
Tim
e no
rmal
ized
by
2D
2.5D MM vs 2D MM on BG/P (log2(n3/P)=37)
compute timeidle time
communication time
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 7/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
2.5D LU with different collectives
0
0.5
1
1.5
2
P=2048, 2D
P=2048, 2.5D bnm
P=2048, 2.5D rect
P=16384, 2D
P=16384, 2.5D bnm
P=16384, 2.5D rect
Tim
e no
rmal
ized
by
2D
2.5D LU vs 2D LU on BG/P (log2(n3/P)=37)
compute timeidle time
communication time
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 8/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Topology-aware multicasts
Why the performance discrepancy in multicasts?
Cray machines use binomial multicastsForm spanning tree from a list of nodesRoute copies of message down each branchNetwork contention degrades utilization on a 3D torus
BG/P uses rectangular multicastsRequire network topology to be a k-ary n-cubeForm 2n edge-disjoint spanning trees
Route in different dimensional orderUse both directions of bidirectional network
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 9/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Topology-aware multicasts
2D rectangular multicasts
root2D 4X4 Torus Spanning tree 1 Spanning tree 2
Spanning tree 3 Spanning tree 4 All 4 trees combined
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 10/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Topology-aware multicasts
Another look at that first plot
Just how much better arerectangular algorithms onP = 4096 nodes?
Binomial collectives on XE6
1/30th of linkbandwidth
Rectangular collectives onBG/P
4.3X the link bandwidth
Over 120X improvementin efficiency!
128
256
512
1024
2048
4096
8192
8 64 512 4096B
andw
idth
(MB
/sec
)
#nodes
1 MB multicast on BG/P, Cray XT5, and Cray XE6
BG/PXE6XT5
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 11/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Assumptions and terminologyModel derivationsVerification of multicast model
A basic model for topology-aware collectives
Aim to model scaling behavior of rectangular collectives
Compare with existing binomial collectives models
Assumptions
k-ary n-cube torus networkbroadcast root at corner of cubemessage is large (focus on bandwidth)LogP messaging model appliesDMA device and wormhole routing are used for messaging
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 12/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Assumptions and terminologyModel derivationsVerification of multicast model
Terminology
L physical network latencyo overhead of sending and receiving a messageg reciprocal of single-link bandwidthP number of processors.d dimension of network torusBn = 2d/g total node link bandwidthm message size in bytesγ flop rate of a nodeβ memory bandwidth of a node
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 13/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Assumptions and terminologyModel derivationsVerification of multicast model
Modelling the communication fabric
For large message size Rendezvous messaging is used
handshare is performed with an eager send (2o + L)receiver gets the data with a one-sided get (o + 2L)total cost tr = mr · g + 3o + 3L
The DMA engine can use all torus links simultaneously
L costs in each dimension can be overlappedSender overhead, o, cannot be overlapped
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 14/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Assumptions and terminologyModel derivationsVerification of multicast model
A model for rectangular multicasts
tmcast = m/Bn + 2(d + 1) · o + 3L + d · P1/d · (2o + L)
Our multicast model consists of 3 terms
1 m/Bn, the bandwidth cost incurred at the root
2 2(d + 1) · o + 3L, the start-up overhead of setting up themulticasts in all dimensions
3 d · P1/d · (2o + L), the path overhead reflects the time for apacket to get from the root to the farthest destination node
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 15/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Assumptions and terminologyModel derivationsVerification of multicast model
A model for rectangular reductions
tred = max[m/(8γ), 3m/β,m/Bn]+2(d+1)·o+3L+d ·P1/d ·(2o+L)
Any multicast tree can be inverted to produce a reduction tree
The reduction operator must be applied at each node
each node operates on 2m databoth the memory bandwidth and computation cost can beoverlapped
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 16/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Assumptions and terminologyModel derivationsVerification of multicast model
A model for binomial multicasts
The root of the binomial tree sends the entire messagelog2(P) times
The setup overhead is overlapped with the path overhead
We assume no contention
tbnm = log2(P) · (m/Bn + 2o + L)
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 17/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Assumptions and terminologyModel derivationsVerification of multicast model
Model verification: one dimension
200
400
600
800
1000
1 8 64 512 4096 32768 262144
Ban
dwid
th (M
B/s
ec)
msg size (KB)
DCMF Broadcast on a ring of 8 nodes of BG/P
trect modelDCMF rectangle dput
tbnm modelDCMF binomial
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 18/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Assumptions and terminologyModel derivationsVerification of multicast model
Model verification: two dimensions
0
500
1000
1500
2000
1 8 64 512 4096 32768 262144
Ban
dwid
th (M
B/s
ec)
msg size (KB)
DCMF Broadcast on 64 (8x8) nodes of BG/P
trect modelDCMF rectangle dput
tbnm modelDCMF binomial
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 19/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Assumptions and terminologyModel derivationsVerification of multicast model
Model verification: three dimensions
0
500
1000
1500
2000
2500
3000
1 8 64 512 4096 32768 262144
Ban
dwid
th (M
B/s
ec)
msg size (KB)
DCMF Broadcast on 512 (8x8x8) nodes of BG/P
trect modelKumar et al data
DCMF rectangle dputtbnm model
DCMF binomial
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 20/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Assumptions and terminologyModel derivationsVerification of multicast model
Rectangular reduction performance on BG/P
50
100
150
200
250
300
350
400
8 64 512
Ban
dwid
th (M
B/s
ec)
#nodes
DCMF Reduce peak bandwidth (largest message size)
torus rectangle ringtorus binomial
torus rectangletorus short binomial
BG/P rectangular reduction performs significantly worse thanmulticast
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 21/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Assumptions and terminologyModel derivationsVerification of multicast model
A custom 1D ring reduction implementation
Implemented with near-neighbor MPI Sends
high overhead due to re-establishment of handshakeslarge packet size required
Reduction must be software pipelined to achieve overlap
First version achieves overlap via interrupts
Second version achieves overlap via explictly managedOpenMP threads
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 22/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Assumptions and terminologyModel derivationsVerification of multicast model
Performance of custom line reduction
100
200
300
400
500
600
700
800
512 4096 32768
Ban
dwid
th (M
B/s
ec)
msg size (KB)
Performance of custom Reduce/Multicast on 8 nodes
MPI BroadcastCustom Ring Multicast
Custom Ring Reduce 2Custom Ring Reduce 1
MPI Reduce
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 23/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Exascale architectureCollectives at exascale2.5D algoritms at exascale
A hypothetical exascale architecture
Total flop rate (γ) 1018 flop/sTotal memory 32 PBNode count (P) 262,144Node interconnect bandwidth (β) 100 GB/sLatency overhead (o) 250 nsNetwork latency (L) 250 nsTopology 3D Torus
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 24/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Exascale architectureCollectives at exascale2.5D algoritms at exascale
Performance of collectives at exascale
0
20
40
60
80
100
120
140
1 8 64 512 4096 32768 262144
Ban
dwid
th (G
B/s
ec)
msg size (MB)
Exascale broadcast performance
trect 3Dtrect 2Dtrect 1Dtbnm 3Dtbnm 2Dtbnm 1D
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 25/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Exascale architectureCollectives at exascale2.5D algoritms at exascale
Performance of matrix multiplication at exascale
0.4
0.5
0.6
0.7
0.8
0.9
1
1 8 64
Par
alle
l effi
cien
cy
z dimension of partition
MM strong scaling at exascale (xy plane to full xyz torus)
2.5D with rectangular (c=z)2.5D with binomial (c=z)
2D with binomial
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 26/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Exascale architectureCollectives at exascale2.5D algoritms at exascale
Performance of LU at exascale
0
0.2
0.4
0.6
0.8
1
1 8 64
Par
alle
l effi
cien
cy
z dimension of partition
LU strong scaling at exascale (xy plane to full xyz torus)
2.5D with rectangular (c=z)2.5D with binomial (c=z)
2D LU with binomial
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 27/ 28
Introduction: Dense Linear Algebra2.5D Performance
MulticastsModelling collectivesPerformance scaling
Conclusions and future work
Conclusions and future work
2.5D algorithms perform well by using rectangular collectives
implement pivoting into 2.5D LUextend 2.5D algorithms to SUMMA algorithm, QR, etc.incorporate 2.5D algorithms into applications (e.g. quantumchemistry)
Collectives models show rectangular algorithms scalingsuperiority
model contention within a multicast treemodel contention among concurrent multicast trees
Topology-awareness and rectangular collectives are importantat exascale
analyze different exascale network topologies
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 28/ 28
Backup slides
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 29/ 28
Insight from BLAS 3 communication lower bounds
Given local memory of size M, each processor communicates
W = Ω
(n3/P√
M
)words.
2D algorithms: M = O(n2/P)
Standard algorithms (ScaLAPACK)Use single copy of input matrices
Move O(
n2√P
)words
2.5D algorithms: M = O(c · n2/P)
Use c copies of input matrices
Move O(
n2√c·P
)words
Less communication via redundant memory usage
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 30/ 28
Virtual topology of 2.5D algorithms
(p/c)-1
i
j
k
0
00
(p/c) -11/2
1/2
c-1
2D algorithm mapping:(√
P)×(√
P)
grid
2.5D algorithm mapping:(√
P/c)×(√
P/c)× c grid for any c
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 31/ 28
2.5D matrix multiplication performance
0
10
20
30
40
50
60
70
80
25 31 37
Per
cent
age
of p
eak
log(n3/P)
2.5D MM on BG/P
P=2048, 2D MMP=2048, 2.5D MM
P=16384, 2D MMP=16384, 2.5D MM
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 32/ 28
2.5D LU performance
0
5
10
15
20
25
30
35
25 31 37
Per
cent
age
of p
eak
log(n3/P)
2.5D LU on BG/P
P=2048, 2D LUP=2048, 2.5D LU
P=16384, 2D LUP=16384, 2.5D LU
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 33/ 28
Reduction in total communication time
0
20
40
60
80
100
25 31 37
Red
uctio
n in
com
mun
icat
ion
time
(%)
log(n3/P)
2.5D vs 2D communication reduction on BG/P
P=2048, MMP=2048, LU
P=16384, MMP=16384, LU
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 34/ 28
Performance analysis
2.5D algorithms map precisely to BG/P partitions
allows usage of rectangular collectivesreduces network contention
Communication time reduced substantially (up to 92%)
mostly due to rectangular collectives
Idle time also reduced in LU
communication along critical path reduced
So rectangular collectives can indeed be used by higher levelalgorithms
Now lets take a closer look at rectangular collectives ondifferent/future architectures
Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 35/ 28