1
U.C. Berkeley and LBNL U.C. Berkeley and LBNL • Global address space languages support • Global pointers and distributed arrays • User controls layout of data across nodes • Direct read and write to remote memory • Single Program Multiple Data (SPMD) control • Similar to using threads, but with remote accesses • Global synchronization, barriers • Languages: UPC, Co-Array Fortran, Titanium • GASNet - A common communication system tailored for global address space languages Distributed Data Structures Latency Performance GASNet Goals • Language-independence: Compatibility with several global-address space languages and compilers • UPC, Titanium, Co-array Fortran, possibly others.. • Hide language- or compiler-specific details, such as shared-pointer representation • Hardware-independence: variety of parallel architectures & OS's • SMP: Linux/UNIX SMP's, Origin 2000, etc. • Clusters of uniprocessors or SMP's: IBM SP, Compaq AlphaServer, Linux/UNIX clusters, etc. • Support many high-performance networks: Infiniband, Myrinet/GM, Quadrics/elan, IBM/LAPI, Dolphin, MPI • Ease of implementation on new hardware • Allow quick prototype implementations • Implementations can leverage performance features of hardware • Provide both portability & performance GASNet Core API Global Address Space Languages GASNet Extended API Bandwidth Performance Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Kathy Yelick Parry Husbands, Costin Iancu, Mike Welcome, Kathy Yelick Compiler-generated code Compiler-specific runtime system GASNet Extended API GASNet Core API Network Hardware • Wider interface that includes more complicated operations • We provide a reference implementation of the extended API in terms of the core API • Implementors can choose to directly implement any subset for performance - leverage hardware support for higher-level operations • Most basic required network primitives • Implemented directly on each platform • Minimal set of network functions needed to support a working implementation • General enough to implement everything else • Based heavily on active messages paradigm • Provides powerful extensibility mechanism http:// http:// upc.nersc.gov upc.nersc.gov [email protected] [email protected] Bandw idth ofG ASNetvapi-conduitand M V A P IC H 0.9.1 M PI 0 100 200 300 400 500 600 700 800 1 2 4 8 16 32 64 128 256 512 1K B 2K B 4KB 8KB 16KB 32 KB 64 KB 12 8 K B 256K B 51 2 K B 1M B M essage S ize (bytes) Bandw idth (M B /sec) M VA PIC H M PI gasnet_put_nb_bulk (source pre-pinned) gasnet_put_nb_bulk (source not pinned) • Using Mellanox VAPI interface • Vendor implementation of the InfiniBand Verbs • Useful minor extensions beyond Verbs • GASNet Core API: Active Messages • Based on Send/Recv operations • Simple flow control • Uses an additional thread for improved responsiveness • GASNet Extended API: RDMA • Thin layer over InfiniBand RDMA (puts and gets) • Simple record attached to each CQE for completion • No dynamic memory registration yet (bounce-buffers used for out-of-segment references) Future Work Implementing GASNet on InfiniBand R oundtrip Latency ofG ASNetvapi-conduitand M V A P IC H 0.9.1 M PI 0 5 10 15 20 25 30 1 2 4 8 16 32 64 128 256 512 1KB 2KB M essage S ize (bytes) R oundtrip Latency (us) M P I_S end/M P I_R ecv ping-pong gasnet_put_nb + sync System Description •4-node cluster •Dual-CPU 2.8Ghz Pentium 4 Xeon, 4GB mem •Red Hat 8.0 w/ Linux 2.4.18-14 •Mellanox InfiniHost (Cougar) IB-4X HCAs –Firmware 3.0, SDK 3.0.1 •DivergeNet 8-port IB-4X switch Firehose Algorithm for distributed management of DMA registration VAPI conduit: • Dynamic memory registration (work in progress) • Extension of “firehose” algorithm for region- oriented registration • C. Bell and D. Bonachea. “A New DMA Registration Strategy for Pinning-Based High Performance Networks CAC 2003 • RDMA-based barrier • General performance tuning General GASNet: • Collective Communication • Point-to-point Scatter/Gather and Strided put/get operations • New GASNet conduit implementations: • Florida State University: SCI/Dolphin • UC Berkeley: shmem (Cray, SGI and Quadrics), and Cray X-1 G A SN etPut/G etR oundtrip Latency (m in overm sg sz) 0 5 10 15 20 25 30 35 40 45 50 55 60 65 mpi elan mpi elan mpi elan mpi elan m pi gm m pi gm mpi lapi lapi- poll m pi gm m pi vapi R oundtrip Latency (m icroseconds) put_nb get_nb quadrics falcon (Alpha) quadrics colt (Alpha) quadrics opus (IA 64) quadrics lemieux (Alpha) myrinet alvarez (x86) Colony/GX seaborg (PowerPC) infiniband pcp (x86 P C I-X) myrinet citris (IA 64) myrinet pcp (x86 P C I-X) G A SN etPut/G etB ulk Flood B andw idth (m ax over m sg sz) 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 mpi elan mpi elan mpi elan mpi elan m pi gm m pi gm mpilapilapi- poll m pi gm m pi vapi Bandw idth (M B /sec) put_nb_bulk get_nb_bulk quadrics falcon (Alpha) quadrics colt (Alpha) quadrics opus (IA64) quadrics lemieux (Alpha) myrinet alvarez (x86) Colony/GX seaborg (P ow erP C) infiniband pcp (x86 P CI-X) myrinet citris (IA64) myrinet pcp (x86 P CI-X) M yrinet-IA 64 R oundtrip ping-pong latency 0 5 10 15 20 25 30 35 1 10 100 1000 10000 M essage S ize (bytes) Roundtrip latency (us) put get put_nbi get_nbi put_nb get_nb Q uadrics-A lphaS erver Flood B andw idth (bulk) 0 50 100 150 200 250 300 0 10000 20000 30000 40000 50000 60000 70000 M essage S ize (bytes) B andw idth (M B /sec) put_bulk get_bulk put_nbi_bulk get_nbi_bulk put_nb_bulk get_nb_bulk Q uadrics-A lphaS erver R oundtrip ping-pong latency 0 2 4 6 8 10 12 14 16 18 20 1 10 100 1000 10000 M essage S ize (bytes) Roundtrip latency (us) put get put_nbi get_nbi put_nb get_nb M yrinet-IA 64 Flood B andw idth (bulk) 0 50 100 150 200 250 0 10000 20000 30000 40000 50000 60000 70000 M essage S ize (bytes) Bandw idth (M B /sec) put_bulk get_bulk put_nbi_bulk get_nbi_bulk put_nb_bulk get_nb_bulk

U.C. Berkeley and LBNL

Embed Size (px)

DESCRIPTION

Compiler-generated code. Compiler-specific runtime system. GASNet Extended API. GASNet Core API. Network Hardware. Distributed Data Structures. Firehose Algorithm for distributed management of DMA registration. System Description 4-node cluster Dual-CPU 2.8Ghz Pentium 4 Xeon, 4GB mem - PowerPoint PPT Presentation

Citation preview

Page 1: U.C. Berkeley and LBNL

U.C. Berkeley and LBNLU.C. Berkeley and LBNL

• Global address space languages support

• Global pointers and distributed arrays

• User controls layout of data across nodes

• Direct read and write to remote memory

• Single Program Multiple Data (SPMD) control

• Similar to using threads, but with remote accesses

• Global synchronization, barriers

• Languages: UPC, Co-Array Fortran, Titanium

• GASNet - A common communication system tailored for global address space languages

• Global address space languages support

• Global pointers and distributed arrays

• User controls layout of data across nodes

• Direct read and write to remote memory

• Single Program Multiple Data (SPMD) control

• Similar to using threads, but with remote accesses

• Global synchronization, barriers

• Languages: UPC, Co-Array Fortran, Titanium

• GASNet - A common communication system tailored for global address space languages

Distributed Data Structures

Distributed Data Structures

Latency PerformanceLatency Performance

GASNet GoalsGASNet Goals• Language-independence: Compatibility with several global-address

space languages and compilers

• UPC, Titanium, Co-array Fortran, possibly others..

• Hide language- or compiler-specific details, such as shared-pointer representation

• Hardware-independence: variety of parallel architectures & OS's

• SMP: Linux/UNIX SMP's, Origin 2000, etc.

• Clusters of uniprocessors or SMP's: IBM SP, Compaq AlphaServer, Linux/UNIX clusters, etc.

• Support many high-performance networks: Infiniband, Myrinet/GM, Quadrics/elan, IBM/LAPI, Dolphin, MPI

• Ease of implementation on new hardware

• Allow quick prototype implementations

• Implementations can leverage performance features of hardware

• Provide both portability & performance

• Language-independence: Compatibility with several global-address space languages and compilers

• UPC, Titanium, Co-array Fortran, possibly others..

• Hide language- or compiler-specific details, such as shared-pointer representation

• Hardware-independence: variety of parallel architectures & OS's

• SMP: Linux/UNIX SMP's, Origin 2000, etc.

• Clusters of uniprocessors or SMP's: IBM SP, Compaq AlphaServer, Linux/UNIX clusters, etc.

• Support many high-performance networks: Infiniband, Myrinet/GM, Quadrics/elan, IBM/LAPI, Dolphin, MPI

• Ease of implementation on new hardware

• Allow quick prototype implementations

• Implementations can leverage performance features of hardware

• Provide both portability & performance

GASNet Core APIGASNet Core API

Global Address Space LanguagesGlobal Address Space Languages

GASNet Extended APIGASNet Extended API

Bandwidth PerformanceBandwidth Performance

Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Kathy YelickParry Husbands, Costin Iancu, Mike Welcome, Kathy Yelick

Compiler-generated code

Compiler-specific runtime system

GASNet Extended API

GASNet Core API

Network Hardware

• Wider interface that includes more complicated operations

• We provide a reference implementation of the extended API in terms of the core API

• Implementors can choose to directly implement any subset for performance - leverage hardware support for higher-level operations

• Wider interface that includes more complicated operations

• We provide a reference implementation of the extended API in terms of the core API

• Implementors can choose to directly implement any subset for performance - leverage hardware support for higher-level operations

• Most basic required network primitives

• Implemented directly on each platform

• Minimal set of network functions needed to support a working implementation

• General enough to implement everything else

• Based heavily on active messages paradigm

• Provides powerful extensibility mechanism

• Most basic required network primitives

• Implemented directly on each platform

• Minimal set of network functions needed to support a working implementation

• General enough to implement everything else

• Based heavily on active messages paradigm

• Provides powerful extensibility mechanism

http://upc.nersc.govhttp://upc.nersc.gov [email protected]@lbl.gov

Bandwidth of GASNet vapi-conduit and MVAPICH 0.9.1 MPI

0

100

200

300

400

500

600

700

800

1 2 4 8 16 32 64 128

256

512

1KB

2KB

4KB

8KB

16KB

32KB

64KB

128K

B

256K

B

512K

B1M

B

Message Size (bytes)

Ba

nd

wid

th (

MB

/se

c)

MVAPICH MPI

gasnet_put_nb_bulk (source pre-pinned)

gasnet_put_nb_bulk (source not pinned)

• Using Mellanox VAPI interface

• Vendor implementation of the InfiniBand Verbs

• Useful minor extensions beyond Verbs

• GASNet Core API: Active Messages

• Based on Send/Recv operations

• Simple flow control

• Uses an additional thread for improved responsiveness

• GASNet Extended API: RDMA

• Thin layer over InfiniBand RDMA (puts and gets)

• Simple record attached to each CQE for completion

• No dynamic memory registration yet (bounce-buffers used for out-of-segment references)

• Using Mellanox VAPI interface

• Vendor implementation of the InfiniBand Verbs

• Useful minor extensions beyond Verbs

• GASNet Core API: Active Messages

• Based on Send/Recv operations

• Simple flow control

• Uses an additional thread for improved responsiveness

• GASNet Extended API: RDMA

• Thin layer over InfiniBand RDMA (puts and gets)

• Simple record attached to each CQE for completion

• No dynamic memory registration yet (bounce-buffers used for out-of-segment references)

Future WorkFuture WorkImplementing GASNet on InfiniBandImplementing GASNet on InfiniBand

Roundtrip Latency of GASNet vapi-conduit and MVAPICH 0.9.1 MPI

0

5

10

15

20

25

30

1 2 4 8 16 32 64 128 256 512 1KB 2KB

Message Size (bytes)

Ro

un

dtr

ip L

aten

cy (u

s)

MPI_Send/MPI_Recv ping-pong

gasnet_put_nb + sync

System Description•4-node cluster•Dual-CPU 2.8Ghz Pentium 4 Xeon, 4GB mem•Red Hat 8.0 w/ Linux 2.4.18-14•Mellanox InfiniHost (Cougar) IB-4X HCAs

–Firmware 3.0, SDK 3.0.1•DivergeNet 8-port IB-4X switch

System Description•4-node cluster•Dual-CPU 2.8Ghz Pentium 4 Xeon, 4GB mem•Red Hat 8.0 w/ Linux 2.4.18-14•Mellanox InfiniHost (Cougar) IB-4X HCAs

–Firmware 3.0, SDK 3.0.1•DivergeNet 8-port IB-4X switch

Firehose Algorithm for distributed management

of DMA registration

Firehose Algorithm for distributed management

of DMA registration

VAPI conduit:

• Dynamic memory registration (work in progress)

• Extension of “firehose” algorithm for region-oriented registration

• C. Bell and D. Bonachea. “A New DMA Registration Strategy for Pinning-Based High Performance Networks ” CAC 2003

• RDMA-based barrier

• General performance tuning

General GASNet:

• Collective Communication

• Point-to-point Scatter/Gather and Strided put/get operations

• New GASNet conduit implementations:

• Florida State University: SCI/Dolphin

• UC Berkeley: shmem (Cray, SGI and Quadrics), and Cray X-1

VAPI conduit:

• Dynamic memory registration (work in progress)

• Extension of “firehose” algorithm for region-oriented registration

• C. Bell and D. Bonachea. “A New DMA Registration Strategy for Pinning-Based High Performance Networks ” CAC 2003

• RDMA-based barrier

• General performance tuning

General GASNet:

• Collective Communication

• Point-to-point Scatter/Gather and Strided put/get operations

• New GASNet conduit implementations:

• Florida State University: SCI/Dolphin

• UC Berkeley: shmem (Cray, SGI and Quadrics), and Cray X-1

GASNet Put/Get Roundtrip Latency (min over msg sz)

0

5

10

15

20

25

30

35

40

45

50

55

60

65

mpi elan mpi elan mpi elan mpi elan mpi gm mpi gm mpi lapi lapi-poll

mpi gm mpi vapi

Ro

un

dtr

ip L

ate

nc

y (

mic

ros

ec

on

ds

)

put_nb

get_nb

quadrics falcon(Alpha)

quadrics colt

(Alpha)

quadrics opus(IA64)

quadrics lemieux(Alpha)

myrinetalvarez(x86)

Colony/GXseaborg

(PowerPC)

infinibandpcp

(x86 PCI-X)

myrinetcitris(IA64)

myrinetpcp

(x86 PCI-X)

GASNet Put/Get Bulk Flood Bandwidth (max over msg sz)

0

50

100

150

200

250

300

350

400

450

500

550

600

650

700

750

mpi elan mpi elan mpi elan mpi elan mpi gm mpi gm mpi lapi lapi-poll

mpi gm mpi vapi

Ba

nd

wid

th (

MB

/se

c)

put_nb_bulk

get_nb_bulk

quadrics falcon(Alpha)

quadrics colt

(Alpha)

quadrics opus(IA64)

quadrics lemieux(Alpha)

myrinetalvarez(x86)

Colony/GXseaborg

(PowerPC)

infinibandpcp

(x86 PCI-X)

myrinetcitris(IA64)

myrinetpcp

(x86 PCI-X)

Myrinet-IA64 Roundtrip ping-pong latency

0

5

10

15

20

25

30

35

1 10 100 1000 10000

Message Size (bytes)

Rou

ndtr

ip la

tenc

y (u

s)

putgetput_nbiget_nbiput_nbget_nb

Quadrics-AlphaServer Flood Bandwidth (bulk)

0

50

100

150

200

250

300

0 10000 20000 30000 40000 50000 60000 70000

Message Size (bytes)

Ban

dw

idth

(MB

/sec

)

put_bulkget_bulkput_nbi_bulkget_nbi_bulkput_nb_bulkget_nb_bulk

Quadrics-AlphaServer Roundtrip ping-pong latency

0

2

4

6

8

10

12

14

16

18

20

1 10 100 1000 10000

Message Size (bytes)

Ro

un

dtr

ip la

tenc

y (u

s)

putgetput_nbiget_nbiput_nbget_nb

Myrinet-IA64 Flood Bandwidth (bulk)

0

50

100

150

200

250

0 10000 20000 30000 40000 50000 60000 70000

Message Size (bytes)

Ban

dw

idth

(MB

/sec

)

put_bulkget_bulkput_nbi_bulkget_nbi_bulkput_nb_bulkget_nb_bulk