41
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Maxime Martinasso * , Grzegorz Kwasniewski , Sadaf R. Alam * , Thomas C. Schulthess *‡§ , Torsten Hoefler * Swiss National Supercomputing Centre, ETH Zurich, 6900 Lugano, Switzerland Department of Computer Science, ETH Zurich, Universit ¨ atstr. 6, 8092 Zurich, Switzerland Institute for Theoretical Physics, ETH Zurich, 8093 Zurich, Switzerland § Computer Science and Mathematics Division, Oak Ridge National Laboratory, USA

ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

A PCIe Congestion-Aware Performance Modelfor Densely Populated Accelerator Servers

Maxime Martinasso∗, Grzegorz Kwasniewski†, Sadaf R. Alam∗,Thomas C. Schulthess∗‡§, Torsten Hoefler†∗Swiss National Supercomputing Centre, ETH Zurich, 6900 Lugano, Switzerland

†Department of Computer Science, ETH Zurich, Universitatstr. 6, 8092 Zurich, Switzerland‡ Institute for Theoretical Physics, ETH Zurich, 8093 Zurich, Switzerland

§Computer Science and Mathematics Division, Oak Ridge National Laboratory, USA

Page 2: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Why more densely populated accelerator servers?

accelerators are faster and more energy-efficient than CPU

densely populated accelerator servers are high performance nodes

reduce space occupancy of the data center

SC16 – A PCIe performance model | 2

Page 3: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Cray CS Storm – new MeteoSwiss supercomputer

2 cabinets

12 hybrid computing nodes per cabinet

2 Intel Haswell 12-core CPUs per node

8 NVIDIA Tesla K80 GPU acceleratorsper node

2 GPU processors per accelerator

192 GPU processors in total

360 GPU teraflops in total

Production system

GPUs connected by PCI-Express

SC16 – A PCIe performance model | 3

Page 4: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

generation 3, 16 GB/s using x16 wide lane

dual simplex (a pair of unidirectional links)

exchange buffer availability between pair of ports of a link

tree-based topology

Building a densely populatedaccelerator servers with PCIe:

PortRCPCIe RootComplex

PCIe Switch GPU

Legend:

x16 Link

RC

CPU CPU

Root Complex

SC16 – A PCIe performance model | 4

Page 5: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

generation 3, 16 GB/s using x16 wide lane

dual simplex (a pair of unidirectional links)

exchange buffer availability between pair of ports of a link

tree-based topology

Building a densely populatedaccelerator servers with PCIe:

PortRCPCIe RootComplex

PCIe Switch GPU

Legend:

x16 Link

RC

CPU CPU

Root Complex

GPUK80

SC16 – A PCIe performance model | 4

Page 6: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

generation 3, 16 GB/s using x16 wide lane

dual simplex (a pair of unidirectional links)

exchange buffer availability between pair of ports of a link

tree-based topology

Building a densely populatedaccelerator servers with PCIe:

PortRCPCIe RootComplex

PCIe Switch GPU

Legend:

x16 Link

RC

CPU CPU

Root Complex

GPUK80

GPUK80

SC16 – A PCIe performance model | 4

Page 7: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

generation 3, 16 GB/s using x16 wide lane

dual simplex (a pair of unidirectional links)

exchange buffer availability between pair of ports of a link

tree-based topology

Building a densely populatedaccelerator servers with PCIe:

PortRCPCIe RootComplex

PCIe Switch GPU

Legend:

x16 Link

RC

CPU CPU

Root Complex

GPUK80

GPUK80

GPUK80

SC16 – A PCIe performance model | 4

Page 8: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

generation 3, 16 GB/s using x16 wide lane

dual simplex (a pair of unidirectional links)

exchange buffer availability between pair of ports of a link

tree-based topology

Building a densely populatedaccelerator servers with PCIe:

PortRCPCIe RootComplex

PCIe Switch GPU

Legend:

x16 Link

RC

CPU CPU

Root Complex

GPUK80

GPUK80

GPUK80

GPUK80

SC16 – A PCIe performance model | 4

Page 9: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

generation 3, 16 GB/s using x16 wide lane

dual simplex (a pair of unidirectional links)

exchange buffer availability between pair of ports of a link

tree-based topology

Building a densely populatedaccelerator servers with PCIe:

PortRCPCIe RootComplex

PCIe Switch GPU

Legend:

x16 Link

RC

K80

0 1 2 3 4 5 6 7

CPU CPU

Root Complex

SC16 – A PCIe performance model | 4

Page 10: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Communication conflicts0→7, 1→4, 6→7

RC

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

8

12

16

18

13

9 10 11

1514

17

19

SC16 – A PCIe performance model | 5

Page 11: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Communication conflicts0→7, 1→4, 6→7

RC

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

8

12

16

18

13

9 10 11

1514

17

19

Upstream portconflict

SC16 – A PCIe performance model | 5

Page 12: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Communication conflicts0→7, 1→4, 6→7

RC

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

8

12

16

18

13

9 10 11

1514

17

19

Upstream portconflict

Downstream portconflict

SC16 – A PCIe performance model | 5

Page 13: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Communication conflicts0→7, 1→4, 6→7

RC

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

8

12

16

18

13

9 10 11

1514

17

19

Upstream portconflict

Downstream portconflict

Crossing the Root Complexconflict

SC16 – A PCIe performance model | 5

Page 14: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Communication conflicts0→7, 1→4, 6→7

RC

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

8

12

16

18

13

9 10 11

1514

17

19

Upstream portconflict

Downstream portconflict

Crossing the Root Complexconflict

Head-of-Line blocking

SC16 – A PCIe performance model | 5

Page 15: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Motivation – COSMO halo exchange

GPU0 GPU1

GPU5GPU4

GPU3GPU2

GPU6 GPU7

2D domain decomposition GPU0 GPU1

GPU4

GPU3

GPU7GPU6

GPU2

GPU5

3D domain decomposition

Which order of communications is the fastest?

0→4 1→0 2→1 3→7 4→0 5→4 6→2 7→60→1 1→2 2→6 3→2 4→5 5→6 6→7 7→3

1→5 2→3 5→1 6→5

0→1 1→0 2→1 3→2 4→0 5→1 6→2 7→30→4 1→5 2→6 3→7 4→5 5→6 6→5 7→6

1→2 2→3 5→4 6→7. . .

SC16 – A PCIe performance model | 6

Page 16: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Motivation – COSMO halo exchange

GPU0 GPU1

GPU5GPU4

GPU3GPU2

GPU6 GPU7

2D domain decomposition GPU0 GPU1

GPU4

GPU3

GPU7GPU6

GPU2

GPU5

3D domain decomposition

Which order of communications is the fastest?

0→4 1→0 2→1 3→7 4→0 5→4 6→2 7→60→1 1→2 2→6 3→2 4→5 5→6 6→7 7→3

1→5 2→3 5→1 6→5

0→1 1→0 2→1 3→2 4→0 5→1 6→2 7→30→4 1→5 2→6 3→7 4→5 5→6 6→5 7→6

1→2 2→3 5→4 6→7. . .

2D domain decomposition example: 20, 376 possibilities

3D domain decomposition has more than 1.6 Million possibilities

SC16 – A PCIe performance model | 6

Page 17: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

PCIe performance model

We want to identify the congestion factors ρ ∈ [0, 1] which limit theavailable bandwidth per communication at each communication phase.

Step (D)Step (C)Step (B)Step (A)

Update messages

Head-of-LineBlocking

Downstreamport

conflicts+ RC

Upstreamport

conflicts

Source Arbitration

Communicationgraph

model: compute

Advance to next time stepUpdate communication graph

Remainingmessages?

Yes

No

End

SC16 – A PCIe performance model | 7

Page 18: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Communication phase – update messages

Elapsed time Lc , message size Mc , set of communication phases Sc :

Time

LC0 = t0 = 1B ·

m0,0

ρ0,0and MC0 = m0,0

SC16 – A PCIe performance model | 8

Page 19: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Communication phase – update messages

Elapsed time Lc , message size Mc , set of communication phases Sc :

Time

LC0 = t0 + t1 = 1B · (

m0,0

ρ0,0+

m0,1

ρ0,1) and MC0 = m0,0 + m0,1

SC16 – A PCIe performance model | 8

Page 20: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Communication phase – update messages

Elapsed time Lc , message size Mc , set of communication phases Sc :

Time

Lc =∑

i∈Scti =

1B

∑i∈Sc

mc,i

ρc,iand Mc =

∑i∈Sc

mc,i

(1) start time and Mc are known(2) ρc,i are given by the model

with (1) and (2) Lc are computable

SC16 – A PCIe performance model | 8

Page 21: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Model conflicts on switch

We want to identify the congestion factors ρ ∈ [0, 1] which limit theavailable bandwidth per communication at each communication phase.

Each communication enters a switch with a congestion factor ρ andleaves with a congestion factor ρ′

If∑

i ρi > 1 then an arbitration policy is required,ρ′ = ρ otherwise

SC16 – A PCIe performance model | 9

Page 22: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Upstream port conflict

Proportional sharing of availablebandwidth

SC16 – A PCIe performance model | 10

Page 23: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Downstream port conflict

Round-robin policy

Performance reduction for crossingthe root complex

CR – set of communications crossing the root complex;n – number of grouped communication sets;R – congestion factor of a grouped communication set;τ – congestion factor for crossing the root complex;

– if CR = ∅ then

R′ =1

n

– if CR 6= ∅ then

R′ =

{min(max( 1

n − τ, 0),R) if R contains comm. ∈ CR

min( 1n + τ,R) otherwise

SC16 – A PCIe performance model | 11

Page 24: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Complete example

RC

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

8

12

16

18

13

9 10 11

1514

17

19

τ = 0.2

Step (D)Step (C)Step (B)Step (A)

Update messages

Head-of-LineBlocking

Downstreamport

conflicts+ RC

Upstreamport

conflicts

Source Arbitration

comm. Step (A) Step (B) Step (C) Step (D)

(a) 0→2 1 1/2 1/2 3/10 3/10

(b) 1→4 1 1/2 3/10 3/10 3/10

(c) 3→2 1 1 1/2 1/2 7/10

(d) 6→4 1 1 7/10 7/10 7/10

SC16 – A PCIe performance model | 12

Page 25: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Complete example

RC

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

8

12

16

18

13

9 10 11

1514

17

19

τ = 0.2Message size: 300MBBandwidth: 11.6 GB/s

Step (D)Step (C)Step (B)Step (A)

Update messages

Head-of-LineBlocking

Downstreamport

conflicts+ RC

Upstreamport

conflicts

Source Arbitration

comm. Step (A) Step (B) Step (C) Step (D)

(a) 0→2 1 1/2 1/2 3/10 3/10

(b) 1→4 1 1/2 3/10 3/10 3/10

(c) 3→2 1 1 1/2 1/2 7/10

(d) 6→4 1 1 7/10 7/10 7/10

Congestion graph step 1

comm. cong. factor data remaining elapsed time

(a) 0→2 3/10 128 MB 36 ms

(b) 1→4 3/10 128 MB 36 ms

(c) 3→2 7/10 0 MB 36 ms

(d) 6→4 7/10 0 MB 36 ms

SC16 – A PCIe performance model | 12

Page 26: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Complete example

RC

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

8

12

16

18

13

9 10 11

1514

17

19

τ = 0.2Message size: 300MBBandwidth: 11.6 GB/s

Step (D)Step (C)Step (B)Step (A)

Update messages

Head-of-LineBlocking

Downstreamport

conflicts+ RC

Upstreamport

conflicts

Source Arbitration

Communicationgraph

model: compute

Advance to next time stepUpdate communication graph

Remainingmessages?

Yes

No

End

Congestion graph step 1

comm. cong. factor data remaining elapsed time

(a) 0→2 3/10 128 MB 36 ms

(b) 1→4 3/10 128 MB 36 ms

(c) 3→2 7/10 0 MB 36 ms

(d) 6→4 7/10 0 MB 36 ms

Congestion graph step 2

comm. cong. factor data remaining elapsed time

(a) 0→2 1/2 0 MB 65 ms

(b) 1→4 1/2 0 MB 65 ms

SC16 – A PCIe performance model | 12

Page 27: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Model Validation

Architecture parameters:B = 11.6GB/sτ = 0.1735

22,259 graphs:non-isomorphiccudaMemcpyAsyncCommunication pattern:scatter, gather, all-to-allEntire set of graphs forsubsets of GPUsRandomly generated'100K communications

Message size: 300 MB

Time no contention:Tref = 25.3ms

95% of communication are in range +/- 15%

SC16 – A PCIe performance model | 13

Page 28: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Back to the motivation – COSMO halo exchangeUpper limit on time to solution, throughput approach

Running mode one instance per socket (8 GPUs)

Large domain size 256x256x80 per GPU

One step triggers 312 halo exchanges

Message size: 40 KB to 254 KB

Uses MPI

(3!)4 × (2!)4 = 20, 736 communication graphs for 2D domain

Congestion graphs sorted by elapsed time0.14

0.16

0.18

0.20

0.22

0.24

0.26

0.28

Tim

e [

s]

current implemented graph in COSMO

1.6x 1.9x

SC16 – A PCIe performance model | 14

GPU0 GPU1

GPU5GPU4

GPU3GPU2

GPU6 GPU7

2D domain decomposition GPU0 GPU1

GPU4

GPU3

GPU7GPU6

GPU2

GPU5

3D domain decomposition

Page 29: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Fastest schedule for 2D decomposition

RC

0 1 2 3 4 5 6 7

RC

0 1 2 3 4 5 6 7

RC

0 1 2 3 4 5 6 7

4 5 6 7

0 1 2 3

4 5 6 7

0 1 2 3

4 5 6 7

0 1 2 3

SC16 – A PCIe performance model | 15

Page 30: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Fastest schedule for 2D decomposition

SC16 – A PCIe performance model | 15

RC

0 1 2 3 4 5 6 7

4 5 6 7

0 1 2 3

RC

Page 31: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Fastest schedule for 2D decomposition

SC16 – A PCIe performance model | 15

RC

0 1 2 3 4 5 6 74 5 6 7

0 1 2 3

RC

Page 32: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Fastest schedule for 2D decomposition

SC16 – A PCIe performance model | 15

RC

0 1 2 3 4 5 6 7

4 5 6 7

0 1 2 3

RC

Page 33: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Fastest schedule for 3D decomposition

RC

0 1 2 3 4 5 6 7

RC

0 1 2 3 4 5 6 7

RC

0 1 2 3 4 5 6 7

4 5 6 7

0 1 2 3

4 5 6 7

0 1 2 3

4 5 6 7

0 1 2 3

SC16 – A PCIe performance model | 16

Page 34: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Fastest schedule for 3D decomposition

SC16 – A PCIe performance model | 16

RC

RC

0 1 2 3 4 5 6 7

4 5 6 7

0 1 2 3

Page 35: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Fastest schedule for 3D decomposition

SC16 – A PCIe performance model | 16

RC

RC

0 1 2 3 4 5 6 74 5 6 7

0 1 2 3

Page 36: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Fastest schedule for 3D decomposition

SC16 – A PCIe performance model | 16

RC

RC

0 1 2 3 4 5 6 7

4 5 6 7

0 1 2 3

Page 37: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

COSMO improvement – fastest schedule

RC

0 1 2 3 4 5 6 7

RC

0 1 2 3 4 5 6 7

RC

0 1 2 3 4 5 6 7

4 5 6 7

0 1 2 3

4 5 6 7

0 1 2 3

4 5 6 7

0 1 2 3

COSMO gain: 5.6% per halo exchange step,gain is limited by MPI 2-sided overhead.

SC16 – A PCIe performance model | 17

Page 38: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Conclusion

- Latency not modeled

- MPI 2-sided overhead not modeled (use one-sided?)

+ Captures all PCIe features including congestion

+ Simple model only 2 parameters (B and τ )

+ Precise for large messages

+ Design of topology-aware algorithms

+ COSMO halo exchange performance gain

SC16 – A PCIe performance model | 18

Page 39: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

Thank you for your attention.

Page 40: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

2 8 32 128 512 2K 8K 32K 128K 512K 2M 8M 32MMessage size [Byte]

0

1

2

3

4

5

6

7

8

9

10

11

12

Measu

red b

andw

itdh [

GB

/s]

1.88xslower forconflict 1

1.80xslower forconflict 2

1.44xslower forconflict 3

1.21x

0→ 1 or 0→ 3 alone4→ 1 alone0→ 3 in conflict 10→ 1 in conflict 20→ 1 in conflict 3

τ = 1− 1/1.21

SC16 – A PCIe performance model | 19

Page 41: ETH Z · Title: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Author: Maxime Martinasso[0pt][0pt]*, Grzegorz Kwasniewski[0pt][0pt], Sadaf R

RCSocket

K80

TOPOLOGY T2

0 1 2 3 4 5 6 7

PortRCPCIe RootComplex

PCIe Switch GPU

K80

TOPOLOGY T1

0 1 2 3 4 5 6 7

RCSock

et

Legend:

x16 Link

SC16 – A PCIe performance model | 20