1
CATEGORY: SUPERCOMPUTING POSTER SC03 CONTACT NAME Toshihiro Hanawa: hanawa@ccs.tsukuba.ac.jp HA-PACS/TCA will be constructed in the Fall of 2013 using PEACH2 board in each node. HA-PACS/TCA will be managed with HA- PACS base cluster, and HA-PACS will be a Peta-scale class HPC cluster totally. HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication among GPUs Center for Computational Sciences GPU0 PEACH 2 Internal Memory PCIe Address Space Node 1 Node 2 CPU 1 N E W N E W 2 N E W 3 N E W 0 0 E W 1 E W 2 E W 3 E W PEACH2 Address Space GPU1 Node 3 Node 0 (Front View) (Top View) Toshihiro Hanawa Yuetsu Kodama Taisuke Boku Mitsuhisa Sato Center for Computational Sciences, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, 305-8577 Japan CPU Intel Xeon E5-2670 2.6 GHz 8 core x 2 socket Memory DDR3-1600 4ch. 128 Gbytes Mother Board SuperMicro X9DRG-QF Intel S2600IP PEACH2( FPGA) Altera Stratix IV 530GX EP4SGX530NF45C2 OS CentOS 6.3 Kernel 2.6.32-279.19.1.el6.x86_64 GPU NVIDIA K20c Driver NVIDIA-Linux-x86_64-310.32 CUDA 5.0 MPI MVAPICH2 1.8.1 (with “Enable cuda” option) Concept GPGPU is now widely used for accelerating scientific and engineering computing to improve performance significantly with less power consumption. However, I/O bandwidth bottleneck causes serious performance degradation on GPGPU computing. To solve this problem, TCA (Tightly Coupled Accelerators) enables direct communication among multiple GPUs over computation nodes. HA-PACS project is supported by MEXT special fund program named Research and Education on Interdisciplinary Computational Science Based on Exascale Computing Technology Development, “ and by the JST/CREST program named “Research and Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale Era,” in the research area of Development of System Software Technologies for post-Peta Scale High Performance Computing.” Contact E-mail: [email protected] URL: http://www.ccs.tsukuba.ac.jp/CCS/eng/ research-activities/projects/ha-pacs Figure 5. TCA Communication (PEACH2) Board Preliminary Evaluation Figure 2. Block Diagram of TCA Nodes of HA-PACS/TCA CPU & GPU side (Endpoint) To PEACH2 (Root Complex / Endpoint) To PEACH2 (Endpoint) To PEACH2 (Root Complex) NIOS (CPU) Memory DMAC Routing function N port E port W port S port PCI Express Direct Communication Link PEARL (PCI Express Adaptive and Reliable Link): a concept for direct communication by using PCI Express technology PEACH2 (PCI Express Adaptive Communication Hub ver. 2) : implementation of communication hardware for PEARL HA-PACS System HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences) is the 8th generation of PACS/PAX series supercomputer in Center for Computational Sciences, University of Tsukuba. HA-PACS system consists of two parts as follows. HA-PACS base cluster is operated for the development and product- run on advanced scientific computations since Feb. 2012 with 802 Tflops of peak performance. Each node is equipped with two Intel Xeon E5 CPUs, four M2090 GPUs, and dual port InfiniBand QDR. HA-PACS/TCA is an extension of base cluster, and each node has not only InfiniBand HCA but also TCA Communication board (PEACH2 board) as the proprietary network for GPU. Figure 1. Photograph of HA-PACS base cluster system (26 racks) Figure 4. Address Mapping and Routing Mechanism on TCA Overview of PEACH2 Chip PCI Express Gen2 x8, 4 ports 1 port for host connection 3 ports for p2p connection with PCIe external cable Implementation by FPGA - Altera Stratix IV GX Clock frequency: 250MHz DMA controller with 4ch - Chaining DMA mechanism - Block Stride Access Support Unique PCIe address is assigned for the node and device within TCA sub-cluster, and routing can be decided only by comparing destination PCIe address (Fig.4). Figure 3. Block Diagram of PEACH2 chip GPU-to-GPU communication over the node under TCA is realized by GPU Direct for RDMA Support with Kepler and CUDA TCA sub-cluster including 8 to 16 nodes is built up as a basic unit, and TCA sub-clusters are connected with InfiniBand. Hybrid Communication between TCA and InfiniBand InfiniBand Network QPI PCIe Infini Band G3 x16 G3 x16 G3 x16 G3 x16 G2 x8 G3 x8 GPU0 CPU0 CPU1 GPU1 GPU2 GPU3 PEACH2 QPI PCIe Infini Band G3 x16 G3 x16 G3 x16 G3 x16 G2 x8 G3 x8 GPU0 CPU0 CPU1 GPU1 GPU2 GPU3 PEACH2 G2 x 8 G2 x8 G2 x 8 G2 x8 PEACH2 PEACH2 G2 x8 Future Plan 0 500 1000 1500 2000 2500 3000 3500 8 32 128 512 2048 BandWidth (Mbytes/sec) Data Size (Bytes) CPU-to-CPU CPU-to-GPU GPU-to-CPU GPU-to-GPU Figure 7. DMA bandwidth between components in local node by PEACH2 (with 128 times, within driver) Table 1. Evaluation Environment Summary of Evalua:on PEACH2 achieves 90% of theoretical peak on PCIe Gen2 x8 with 3.3Gbytes/sec. GPU read performance is limited up to 800 Mbytes/sec, and GPU write performance is almost same as that of CPU read/write. Pingpong performance is almost maximum at 128Kbytes. PEACH2 is 3.5x faster than CUDA p2p copy with UVA in a node, and 6.8x faster than MVAPICH2 with short size. 0 500 1000 1500 2000 2500 3000 3500 8 64 512 4096 32768 Bandwidth (Mbytes/sec) Data size (Bytes) CPU-to-CPU CPU-to-GPU GPU-to-GPU Figure 8. Bandwidth by Pingpong between two nodes (user level) 0 5 10 15 20 25 30 35 40 8 32 128 512 2048 8192 Latency (usec) Data size (Bytes) CUDA(no-UVA) CUDA(UVA) MVAPICH2(2node) MVAPICH2(IPC=1) MVAPICH2(IPC=0) PEACH2 Figure 9. GPU-to-GPU latency and bandwidth between two nodes in comparison with CUDA p2p copy and MVAPICH2 0 500 1000 1500 2000 2500 3000 3500 8 64 512 4096 32768 Bandwidth (Mbytes/sec) Data size (Bytes) CUDA(no-UVA) CUDA(UVA) MVAPICH2(2node) MVAPICH2(IPC=1) MVAPICH2(IPC=0) PEACH2

HA-PACS/TCA ightl ple ccelerator fo tenc mmunicatio on PUs · HA-PACS/TCA ightl ple ccelerator fo tenc mmunicatio on PUs Cent o omputation ciences GPU0 PEACH 2 Internal Memory PCIe

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HA-PACS/TCA ightl ple ccelerator fo tenc mmunicatio on PUs · HA-PACS/TCA ightl ple ccelerator fo tenc mmunicatio on PUs Cent o omputation ciences GPU0 PEACH 2 Internal Memory PCIe

Category: Supercomputingposter

Sc03 contact name

toshihiro Hanawa: [email protected]

HA-PACS/TCA will be constructed in the Fall of 2013 using PEACH2 board in each node. HA-PACS/TCA will be managed with HA-PACS base cluster, and HA-PACS will be a Peta-scale class HPC cluster totally.

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication among GPUs

Center for Computational Sciences

GPU0

PEACH2

Internal Memory PCIe Address

Space

Node 1

Node 2

CPU

1

N

E

W

N

E

W

2

N

E

W

3

N

E

W

0

0 E W

1 E

W

2 E

W

3 E

W

PE

AC

H2

Add

ress

Spa

ce

GPU1 Node 3

Node 0

(Front View) (Top View)

Toshihiro Hanawa Yuetsu Kodama Taisuke Boku Mitsuhisa Sato Center for Computational Sciences, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, 305-8577 Japan

CPU Intel Xeon E5-2670 2.6 GHz 8 core x 2 socket

Memory DDR3-1600 4ch. 128 Gbytes Mother Board

SuperMicro X9DRG-QF Intel S2600IP

PEACH2(FPGA)

Altera Stratix IV 530GX EP4SGX530NF45C2

OS CentOS 6.3 Kernel 2.6.32-279.19.1.el6.x86_64 GPU NVIDIA K20c Driver NVIDIA-Linux-x86_64-310.32 CUDA 5.0 MPI MVAPICH2 1.8.1

(with “Enable cuda” option)

Concept GPGPU is now widely used for accelerating scientific and

engineering computing to improve performance significantly with less power consumption. However, I/O bandwidth bottleneck causes serious performance degradation on GPGPU computing.

To solve this problem, TCA (Tightly Coupled Accelerators) enables direct communication among multiple GPUs over computation nodes.

HA-PACS project is supported by MEXT special fund program named “Research and Education on Interdisciplinary Computational Science Based on Exascale Computing Technology Development, “ and by the JST/CREST program named “Research and Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale Era,” in the research area of “Development of System Software Technologies for post-Peta Scale High Performance Computing.”

Contact E-mail: [email protected] URL: http://www.ccs.tsukuba.ac.jp/CCS/eng/

research-activities/projects/ha-pacs Figure 5. TCA Communication (PEACH2) Board

Preliminary Evaluation

Figure 2. Block Diagram of TCA Nodes of HA-PACS/TCA

CPU & GPU side

(Endpoint)

To PEACH2 (Root Complex /

Endpoint)

To PE

AC

H2

(Endpoint)

To PE

AC

H2

(Root C

omplex)

NIOS (CPU)

Memory

DMAC

Routing function

N  port

E port

W port

S port

PCI Express Direct Communication Link PEARL (PCI Express Adaptive and Reliable Link): a concept for direct communication by using PCI Express technology PEACH2 (PCI Express Adaptive Communication Hub ver. 2): implementation of communication hardware for PEARL

HA-PACS System HA-PACS (Highly Accelerated Parallel Advanced system for

Computational Sciences) is the 8th generation of PACS/PAX series supercomputer in Center for Computational Sciences, University of Tsukuba. HA-PACS system consists of two parts as follows.

HA-PACS base cluster is operated for the development and product-run on advanced scientific computations since Feb. 2012 with 802 Tflops of peak performance. Each node is equipped with two Intel Xeon E5 CPUs, four M2090 GPUs, and dual port InfiniBand QDR.

HA-PACS/TCA is an extension of base cluster, and each node has not only InfiniBand HCA but also TCA Communication board (PEACH2 board) as the proprietary network for GPU.

Figure 1. Photograph of HA-PACS base cluster system (26 racks)

Figure 4. Address Mapping and Routing Mechanism on TCA

Overview  of  PEACH2  Chip  •  PCI Express Gen2 x8, 4 ports •  1 port for host connection •  3 ports for p2p connection

with PCIe external cable •  Implementation by FPGA -  Altera Stratix IV GX

•  Clock frequency: 250MHz •  DMA controller with 4ch -  Chaining DMA mechanism -  Block Stride Access Support

•  Unique PCIe address is assigned for the node and device within TCA sub-cluster, and routing can be decided only by comparing destination PCIe address (Fig.4).

Figure 3. Block Diagram of PEACH2 chip •  GPU-to-GPU communication over the node under TCA is realized

by GPU Direct for RDMA Support with Kepler and CUDA •  TCA sub-cluster including 8 to 16 nodes is built up as a basic unit,

and TCA sub-clusters are connected with InfiniBand. Hybrid Communication between TCA and InfiniBand

Infin

iBan

d N

etw

ork

QPI

PCIe

InfiniBand

G3

x16

G3

x16

G3

x16

G3

x16

G2 x8 G3 x8

GPU0

CPU0 CPU1

GPU1 GPU2 GPU3

PEACH2

QPI

PCIe

InfiniBand

G3

x16

G3

x16

G3

x16

G3

x16

G2 x8 G3 x8

GPU0

CPU0 CPU1

GPU1 GPU2 GPU3

PEACH2

G2

x 8

G2

x8

G2

x 8

G2 x8 PEACH2

PEACH2 G2 x8

Future Plan

0

500

1000

1500

2000

2500

3000

3500

8 32 128 512 2048

Ban

dWid

th (M

byte

s/se

c)

Data Size (Bytes)

CPU-to-CPU CPU-to-GPU GPU-to-CPU GPU-to-GPU

Figure 7. DMA bandwidth between components in local node by PEACH2

(with 128 times, within driver)

Table 1. Evaluation Environment

Summary  of  Evalua:on  •  PEACH2 achieves 90% of

theoretical peak on PCIe Gen2 x8 with 3.3Gbytes/sec.

•  GPU read performance is limited up to 800 Mbytes/sec, and GPU write performance is almost same as that of CPU read/write.

•  Pingpong performance is almost maximum at 128Kbytes.

•  PEACH2 is 3.5x faster than CUDA p2p copy with UVA in a node, and 6.8x faster than MVAPICH2 with short size.

0

500

1000

1500

2000

2500

3000

3500

8 64 512 4096 32768

Ban

dwid

th (M

byte

s/se

c)

Data size (Bytes)

CPU-to-CPU CPU-to-GPU GPU-to-GPU

Figure 8. Bandwidth by Pingpong between two nodes (user level)

0

5

10

15

20

25

30

35

40

8 32 128 512 2048 8192

Late

ncy

(use

c)

Data size (Bytes)

CUDA(no-UVA) CUDA(UVA)

MVAPICH2(2node) MVAPICH2(IPC=1)

MVAPICH2(IPC=0) PEACH2

Figure 9. GPU-to-GPU latency and bandwidth between two nodes in comparison with CUDA p2p copy and MVAPICH2

0

500

1000

1500

2000

2500

3000

3500

8 64 512 4096 32768

Ban

dwid

th (M

byte

s/se

c)

Data size (Bytes)

CUDA(no-UVA) CUDA(UVA) MVAPICH2(2node) MVAPICH2(IPC=1) MVAPICH2(IPC=0) PEACH2