Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Designing Multi-Leader based Allgather Algorithms for Multi-core Clusters
Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop and Dhabaleswar.
K. Panda
Computer Science & Engineering Department The Ohio State University
Outline
• Introduction and Background • Motivation • Related Work • Multi-Leader based Algorithms • Experimental evaluation • Conclusions and Future Work
Introduction and Background • MPI is the de-facto programming model for HPC • Multi-core clusters are becoming increasingly
common • Modern interconnects like InfiniBand offer high-
bandwidth and low-latency • The collective communication primitives
consume a significant amount of time • Necessary to have multi-core aware collective
designs
Allgather Communication • Each process broadcasts a vector data to every other
process in the group • Commonly used algorithms : • Recursive Doubling (RD) Algorithm for small messages tcomm = ts * log(p) + tw * (p -1) * m • Ring Algorithm for large messages tcomm = (ts + tw * m) * (p -1) tcomm - Total Communication cost ts - Communication start-up cost tw - Cost of sending a byte of data p - Number of processes m - Message Size.
Outline
• Introduction and Background • Motivation • Related Work • Multi-Leader based Algorithms • Experimental evaluation • Conclusions and Future Work
Scaling on Multi-cores : Recursive Doubling Algorithm
0
500
1000
1500
2000
2500
3000
3500
4000
32 64 128 256 512 1024 2048 4096
Late
ncy
(use
c)
Message Size (Bytes) Recursive Doubling (RD) scales poorly with
increasing core counts
16 Cores/Node 8 Cores/Node
Scaling on Multi-cores : Ring Algorithm
0 10000 20000 30000 40000 50000 60000 70000 80000 90000
8192 16384 32768 65536 131072 262144
Late
ncy
(use
c)
Message Size (Bytes) Ring Algorithm scales as expected with increasing
core counts
16 Cores/Node 8 Cores/Node
Scaling on Large Scale Multi-core clusters (Recursive Doubling)
0
10000
20000
30000
40000
50000
60000
64 128 256 512 1024 2048 4096 8192
Late
ncy
(use
c)
Message Size (Bytes) Recursive Doubling (RD) scales poorly for large
system size
128 Processes
256 Processes
512 Processes
1024 Processes
Scaling on Large Scale Multi-core clusters (Ring Algorithm)
0
200000
400000
600000
800000
1000000
1200000
16384 32768 65536 131072 262144
Late
ncy
(use
c)
Message Size (Bytes) Ring Algorithm scales as expected for
large system sizes
128 Processes 256 Processes 512 Processes 1024 Processes
Problem Statement
• Is it possible to design an algorithm to : - be Multi-core and NUMA aware to achieve
better performance and scalability as core-counts and system sizes increase?
- fully exploit the differential memory access costs in NUMA based Multi-core systems?
Outline
• Introduction and Background • Motivation • Related Work • Multi-Leader based Algorithms • Experimental evaluation • Conclusions and Future Work
Collective Design Framework
Collective Algorithms
Conventional Schemes
Pt2pt
Single Leader Schemes
Pt2pt Shmem
Existing Multi-core aware Algorithms
• Single Leader approaches : Aggregation – Distribution . Step 1 : Data aggregation at the leader on each node Step 2 : Inter leader exchanges Step 3 : Data distribution within each node
Steps 1 and 3 are intra-node operations. → Point-to-point MPI calls → Shared memory buffer visible to all the processes
within a node
Performance of Single Leader Schemes
0
500
1000
1500
2000
2500
3000
3500
4000
4500
32 64 128 256 512 1024 2048
Late
ncy
(use
c)
Message Size (Bytes) Single Leader schemes show potential for
improvement
Single Leader pt2pt Single Leader shmem Conventional(RD)
Performance of Single Leader Schemes
0
5000
10000
15000
20000
25000
1024 2048 4096 8192 16384
Late
ncy
(use
c)
Message Size (Bytes) Conventional Ring Algorithm performs better for
larger messages
Single Leader pt2pt Single Leader shmem Conventional(Ring)
Performance of Single Leader Schemes
0
50000
100000
150000
200000
250000
300000
350000
400000
16384 32768 65536 131072 262144
Late
ncy
(use
c)
Message Size (Bytes) Conventional Ring Algorithm performs better for larger
messages
Single Leader pt2pt Conventional(Ring)
Outline
• Introduction and Background • Motivation • Related Work • Multi-Leader based Algorithms • Experimental evaluation • Conclusions and Future Work
AMD Barcelona Architecture
Node C1 C2
C3 C4
Socket 1
Mem
HT Links
Mem Mem
Mem
C1 C2
C3 C4
Socket 2
C1 C2
C3 C4
Socket 4
C1 C2
C3 C4
Socket 3
Single Leader algorithms on the AMD Barcelona Architecture
Shared Memory buffer Leader Process
C1 C2
C3 C4 Socket 1
Mem
HT Links
Mem Mem
Mem
C1 C2
C3 C4 Socket 2
C1 C2
C3 C4 Socket 4
C1 C2
C3 C4 Socket 3
Proposed Collective Design Framework
Collective Algorithms
Conventional Schemes
Pt2pt
Single Leader Schemes
Pt2pt Shmem
Multi Leader Schemes
Pt2pt Shmem
Multi-Leader based Algorithms
• Number of leader processes per node • Intra-socket and Inter-leader exchange
algorithms.
Outline
• Introduction and Background • Motivation • Related Work • Multi-Leader based Algorithms • Experimental evaluation • Conclusions and Future Work
Experimental Test-bed
• Each node of our testbed has 16 AMD Opteron 1.95 Ghz processors with 512 KB L2 cache. We used 8 nodes.
• Each node has 16 GB memory and PCI-Express bus, 2 MT25418 DDR HCAs with PCI-Ex interfaces.
• 24-port Mellanox switch is used to connect all the nodes.
• RedHat Enterprise Linux Server 5. �
Performance of Multi-Leader pt2pt
0 500
1000 1500 2000 2500 3000 3500 4000 4500
32 64 128 256 512 1024 2048
Late
ncy
(use
c)
Message Size (Bytes) 4-Leader scheme does about 20% better than Single
Leader scheme and 50% better than RD
1 Leader pt2pt 2 Leader pt2pt 4 Leader pt2pt 8 Leader pt2pt Conventional(RD)
Performance of Multi-leader : Shared Memory
0
500
1000
1500
2000
2500
3000
3500
4000
4500
32 64 128 256 512 1024 2048
Late
ncy
(use
c)
Message Size (Bytes) 4-Leader scheme performs better than 1-Leader scheme by about 25% and 70% better than RD
1 Leader shmem 2 Leader shmem 4 Leader shmem 8 Leader shmem Conventional(RD)
Performance of Multi-Leader Schemes (pt2pt Vs Shared Memory)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
32 64 128 256 512 1024 2048
Late
ncy
(use
c)
Message Size (Bytes) 4-Leader Shared Memory approach performs better than 4-Leader Point-to-point scheme by about 40%
4-Leader pt2pt 4-Leader shmem Conventional(RD)
Performance of Multi-Leader schemes on large scale Multi-cores
0 2000 4000 6000 8000
10000 12000 14000 16000 18000
Late
ncy
(use
c)
Message Size (Bytes) 4-Leader Point-to-point scheme outperforms the
recursive doubling method on 1024 processes on the TACC Ranger
1-Leader pt2pt 2-Leader pt2pt 4-Leader pt2pt 8-Leader pt2pt Conventional(RD)
Performance of Multi-Leader schemes on large scale Multi-cores
0
20000
40000
60000
80000
100000
120000
140000
160000
1024 2048 4096 8192 16384
Late
ncy
(use
c)
Message Size (Bytes) Conventional Ring Algorithm performs better for
larger messages
1-Leader pt2pt 2-Leader pt2pt 4-Leader pt2pt 8-Leader pt2pt Conventional(Ring)
Proposed Unified Scheme
Intra-Node Mechanism
Inter-Leader Algorithm
Design
Small Messages Point-to-Point Recursive Doubling
Hierarchical
Medium Messages
Shared Memory
Recursive / Ring
Hierarchical
Large Messages Point-to-Point Ring Conventional
Outline
• Introduction and Background • Motivation • Related Work • Multi-Leader based Algorithms • Experimental evaluation • Conclusions and Future Work
Conclusions & Future Work
• Single Leader schemes are limited by scalability and memory contention. Proposed Multi-Leader schemes perform show significant performance benefits.
• Future work: - Examine the benefits of using kernel based zero-copy
intra-node exchanges for large messages. - A frame-work that can choose leaders in an optimal
manner for emerging multi-core systems. - Evaluate the impact of such designs on real-world
applications.