46
Designing Multi-Leader based Allgather Algorithms for Multi-core Clusters Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University

Designing Multi-Leader based Allgather Algorithms for

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Designing Multi-Leader based Allgather Algorithms for Multi-core Clusters

Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop and Dhabaleswar.

K. Panda

Computer Science & Engineering Department The Ohio State University

Outline

•  Introduction and Background •  Motivation •  Related Work •  Multi-Leader based Algorithms •  Experimental evaluation •  Conclusions and Future Work

Introduction and Background •  MPI is the de-facto programming model for HPC •  Multi-core clusters are becoming increasingly

common •  Modern interconnects like InfiniBand offer high-

bandwidth and low-latency •  The collective communication primitives

consume a significant amount of time •  Necessary to have multi-core aware collective

designs

Allgather Communication •  Each process broadcasts a vector data to every other

process in the group •  Commonly used algorithms : • Recursive Doubling (RD) Algorithm for small messages tcomm = ts * log(p) + tw * (p -1) * m • Ring Algorithm for large messages tcomm = (ts + tw * m) * (p -1) tcomm - Total Communication cost ts - Communication start-up cost tw - Cost of sending a byte of data p - Number of processes m - Message Size.

Outline

•  Introduction and Background •  Motivation •  Related Work •  Multi-Leader based Algorithms •  Experimental evaluation •  Conclusions and Future Work

Recursive Doubling (RD) Algorithm on Multi-cores

0 4 3 1 1 2

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Ring Algorithm on Multi-cores

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Scaling on Multi-cores : Recursive Doubling Algorithm

0

500

1000

1500

2000

2500

3000

3500

4000

32 64 128 256 512 1024 2048 4096

Late

ncy

(use

c)

Message Size (Bytes) Recursive Doubling (RD) scales poorly with

increasing core counts

16 Cores/Node 8 Cores/Node

Scaling on Multi-cores : Ring Algorithm

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

8192 16384 32768 65536 131072 262144

Late

ncy

(use

c)

Message Size (Bytes) Ring Algorithm scales as expected with increasing

core counts

16 Cores/Node 8 Cores/Node

Scaling on Large Scale Multi-core clusters (Recursive Doubling)

0

10000

20000

30000

40000

50000

60000

64 128 256 512 1024 2048 4096 8192

Late

ncy

(use

c)

Message Size (Bytes) Recursive Doubling (RD) scales poorly for large

system size

128 Processes

256 Processes

512 Processes

1024 Processes

Scaling on Large Scale Multi-core clusters (Ring Algorithm)

0

200000

400000

600000

800000

1000000

1200000

16384 32768 65536 131072 262144

Late

ncy

(use

c)

Message Size (Bytes) Ring Algorithm scales as expected for

large system sizes

128 Processes 256 Processes 512 Processes 1024 Processes

Problem Statement

•  Is it possible to design an algorithm to : - be Multi-core and NUMA aware to achieve

better performance and scalability as core-counts and system sizes increase?

- fully exploit the differential memory access costs in NUMA based Multi-core systems?

Outline

•  Introduction and Background •  Motivation •  Related Work •  Multi-Leader based Algorithms •  Experimental evaluation •  Conclusions and Future Work

Collective Design Framework

Collective Algorithms

Conventional Schemes

Pt2pt

Single Leader Schemes

Pt2pt Shmem

Existing Multi-core aware Algorithms

•  Single Leader approaches : Aggregation – Distribution . Step 1 : Data aggregation at the leader on each node Step 2 : Inter leader exchanges Step 3 : Data distribution within each node

Steps 1 and 3 are intra-node operations. → Point-to-point MPI calls → Shared memory buffer visible to all the processes

within a node

Single Leader Algorithms : Step1 intra-node (pt2pt)

* * *

Single Leader Algorithms : Step2 inter-node (pt2pt)

* * *

Single Leader Algorithms : Step 3 intra-node (pt2pt)

* * *

Single Leader Algorithms : Step1 intra-node (shmem)

* * *

Single Leader Algorithms : Step1 intra-node (shmem)

* * *

Single Leader Algorithms : Step2 inter-node (pt2pt)

* * *

Single Leader Algorithms : Step 3 intra-node (shmem)

* * *

Single Leader Algorithms : Step3 intra-node (shmem)

* * *

Performance of Single Leader Schemes

0

500

1000

1500

2000

2500

3000

3500

4000

4500

32 64 128 256 512 1024 2048

Late

ncy

(use

c)

Message Size (Bytes) Single Leader schemes show potential for

improvement

Single Leader pt2pt Single Leader shmem Conventional(RD)

Performance of Single Leader Schemes

0

5000

10000

15000

20000

25000

1024 2048 4096 8192 16384

Late

ncy

(use

c)

Message Size (Bytes) Conventional Ring Algorithm performs better for

larger messages

Single Leader pt2pt Single Leader shmem Conventional(Ring)

Performance of Single Leader Schemes

0

50000

100000

150000

200000

250000

300000

350000

400000

16384 32768 65536 131072 262144

Late

ncy

(use

c)

Message Size (Bytes) Conventional Ring Algorithm performs better for larger

messages

Single Leader pt2pt Conventional(Ring)

Outline

•  Introduction and Background •  Motivation •  Related Work •  Multi-Leader based Algorithms •  Experimental evaluation •  Conclusions and Future Work

AMD Barcelona Architecture

Node C1 C2

C3 C4

Socket 1

Mem

HT Links

Mem Mem

Mem

C1 C2

C3 C4

Socket 2

C1 C2

C3 C4

Socket 4

C1 C2

C3 C4

Socket 3

Single Leader algorithms on the AMD Barcelona Architecture

Shared Memory buffer Leader Process

C1 C2

C3 C4 Socket 1

Mem

HT Links

Mem Mem

Mem

C1 C2

C3 C4 Socket 2

C1 C2

C3 C4 Socket 4

C1 C2

C3 C4 Socket 3

Proposed Collective Design Framework

Collective Algorithms

Conventional Schemes

Pt2pt

Single Leader Schemes

Pt2pt Shmem

Multi Leader Schemes

Pt2pt Shmem

Multi-Leader based Algorithms

• Number of leader processes per node •  Intra-socket and Inter-leader exchange

algorithms.

Multi-Leader based Algorithms(Step 1)

N1 S1

*

S2

*

S3

*

S4

*

N2

N4 N3 * * * *

* * * *

* * * *

Multi-Leader based Algorithms(Step 2)

N1 S1

*

S2

*

S3

*

S4

*

N2

N4 N3 * * * *

* * * *

* * * *

Multi-Leader based Algorithms(Step 3)

N1 S1

*

S2

*

S3

*

S4

*

N2

N4 N3 * * * *

* * * *

* * * *

Outline

•  Introduction and Background •  Motivation •  Related Work •  Multi-Leader based Algorithms •  Experimental evaluation •  Conclusions and Future Work

Experimental Test-bed

•  Each node of our testbed has 16 AMD Opteron 1.95 Ghz processors with 512 KB L2 cache. We used 8 nodes.

•  Each node has 16 GB memory and PCI-Express bus, 2 MT25418 DDR HCAs with PCI-Ex interfaces.

•  24-port Mellanox switch is used to connect all the nodes.

•  RedHat Enterprise Linux Server 5. �

Performance of Multi-Leader pt2pt

0 500

1000 1500 2000 2500 3000 3500 4000 4500

32 64 128 256 512 1024 2048

Late

ncy

(use

c)

Message Size (Bytes) 4-Leader scheme does about 20% better than Single

Leader scheme and 50% better than RD

1 Leader pt2pt 2 Leader pt2pt 4 Leader pt2pt 8 Leader pt2pt Conventional(RD)

Performance of Multi-leader : Shared Memory

0

500

1000

1500

2000

2500

3000

3500

4000

4500

32 64 128 256 512 1024 2048

Late

ncy

(use

c)

Message Size (Bytes) 4-Leader scheme performs better than 1-Leader scheme by about 25% and 70% better than RD

1 Leader shmem 2 Leader shmem 4 Leader shmem 8 Leader shmem Conventional(RD)

Performance of Multi-Leader Schemes (pt2pt Vs Shared Memory)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

32 64 128 256 512 1024 2048

Late

ncy

(use

c)

Message Size (Bytes) 4-Leader Shared Memory approach performs better than 4-Leader Point-to-point scheme by about 40%

4-Leader pt2pt 4-Leader shmem Conventional(RD)

Performance of Multi-Leader schemes on large scale Multi-cores

0 2000 4000 6000 8000

10000 12000 14000 16000 18000

Late

ncy

(use

c)

Message Size (Bytes) 4-Leader Point-to-point scheme outperforms the

recursive doubling method on 1024 processes on the TACC Ranger

1-Leader pt2pt 2-Leader pt2pt 4-Leader pt2pt 8-Leader pt2pt Conventional(RD)

Performance of Multi-Leader schemes on large scale Multi-cores

0

20000

40000

60000

80000

100000

120000

140000

160000

1024 2048 4096 8192 16384

Late

ncy

(use

c)

Message Size (Bytes) Conventional Ring Algorithm performs better for

larger messages

1-Leader pt2pt 2-Leader pt2pt 4-Leader pt2pt 8-Leader pt2pt Conventional(Ring)

Proposed Unified Scheme

Intra-Node Mechanism

Inter-Leader Algorithm

Design

Small Messages Point-to-Point Recursive Doubling

Hierarchical

Medium Messages

Shared Memory

Recursive / Ring

Hierarchical

Large Messages Point-to-Point Ring Conventional

Outline

•  Introduction and Background •  Motivation •  Related Work •  Multi-Leader based Algorithms •  Experimental evaluation •  Conclusions and Future Work

Conclusions & Future Work

•  Single Leader schemes are limited by scalability and memory contention. Proposed Multi-Leader schemes perform show significant performance benefits.

•  Future work: - Examine the benefits of using kernel based zero-copy

intra-node exchanges for large messages. - A frame-work that can choose leaders in an optimal

manner for emerging multi-core systems. - Evaluate the impact of such designs on real-world

applications.

45

http://mvapich.cse.ohio-state.edu

Thank you !

{kandalla, subramon, santhana, koop, panda}@cse.ohio-state.edu

Network-Based Computing Laboratory