28
11 September 2018 Yuichiro Ajima, Takahiro Kawashima, Takayuki Okamoto, Naoyuki Shida, Kouichi Hirai, Toshiyuki Shimizu, Shinya Hiramoto, Yoshiro Ikeda, Takahide Yoshikawa, Kenji Uchida, Tomohiro Inoue Fujitsu Limited The Tofu Interconnect D Copyright 2018 FUJITSU LIMITED 0 September 11th, 2018, IEEE Cluster 2018

The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

11 September 2018

Yuichiro Ajima, Takahiro Kawashima, Takayuki Okamoto,

Naoyuki Shida, Kouichi Hirai, Toshiyuki Shimizu, Shinya Hiramoto,

Yoshiro Ikeda, Takahide Yoshikawa, Kenji Uchida, Tomohiro Inoue

Fujitsu Limited

The Tofu Interconnect D

Copyright 2018 FUJITSU LIMITED0September 11th, 2018, IEEE Cluster 2018

Page 2: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Introduction

The Tofu interconnect (Tofu1) for the K computer

Highly-scalable and fault-tolerant 6D mesh/torus network

The Tofu interconnect 2 (Tofu2) for FX100 machines

The Tofu Interconnect D (TofuD) for the post-K machine

High “density” of node: integrate more resources into a smaller node

Fault resilient of network: “dynamic” packet slicing for packet transfer

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

2009 2012 20152003 2021

InfiniBandDTU

K computer

FX10 FX100FX1HPC2500

Post-K

Tofu2Tofu1 TofuD

1

Page 3: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

6D Mesh/Torus Network

Virtual 3D-Torus Rank-mapping

Implementations

Communication Functions

Tofu Barrier

Networks of Recent World-class Systems

Features of the Tofu interconnect family

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 2

Page 4: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

6D Mesh/Torus Network

Six coordinate axes: X, Y, Z, A, B, C

X, Y, Z: the size varies according to the system configuration

A, B, C: the size is fixed to 2×3×2

Tofu stands for “torus fusion”: (X, Y, Z)×(A, B, C)

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

Z

XC

A

B

X×Y×Z×2×3×2

Y

3

Page 5: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Virtual 3D-Torus Rank-mapping

A rank-mapping option for topology-awareness

A 3D torus rank can be mapped to a 6D submesh even if there is an unavailable node

This fault tolerance contributes to the system availability

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

6D

submesh

0

1110 1

89 2

73 5

64

B Y0

7

1

6

2

5

3

4A

X

07

16

25

34

C

Z

4

Page 6: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Copyright 2018 FUJITSU LIMITED

Tofu1: implemented as an interconnect controller (ICC) chip

4 Tofu network interfaces (TNIs) and 80 lanes of signals for the network

Tofu2: integrated into a processor chip

The number of signal lanes for the network decreased to 40

Considering the balance with 128 signal lanes for memory

Implementations

September 11th, 2018, IEEE Cluster 2018

PCIe

TNI0

TNI1

TNI2

TNI3

To

fu N

etw

ork

Rou

terc c c c c c c c

c c c c c c c c

c

c c c c c c c c

c c c c c c c c

c

HMC HMC HMC HMC

HMC HMC HMC HMC

4 lan

es ×

10 p

ort

s

SPARC64 XIfxC M G

C M G Tofu2

PCIe

TNI0

TNI1

TNI2

TNI3

Tofu

Netw

ork

Route

r

c8 lanes ×

10 p

ort

s

ICC

Tofu1

SPARC64 VIIIfx

DIMMDIMMDIMMDIMMDIMMDIMMDIMMDIMM

c c

c c c

c c

5

Page 7: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Copyright 2018 FUJITSU LIMITED

Remote direct memory access (RDMA)

Directly accesses process memory on remote node

RDMA Put transfers data to remote process memory

RDMA Get transfers data from remote process memory

RDMA Atomic modifies a shared variable in remote process memory

Low latency features

Direct Descriptor: feeds communication commands from CPU registers

Cache Injection (since Tofu2): places received data into a CPU cache

Tofu Barrier

Offload engine for collective communications such as synchronization

Communication Functions of TNI

September 11th, 2018, IEEE Cluster 2018 6

Page 8: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Tofu Barrier (1)

Barrier gate (BG) is a hard-wired communication engine

Waits for two signals from other BGs and transmits two signals

Barrier channel (BCH) is an interface of Tofu barrier

Each BCH is fixedly bound to a start-and-end point BG

Tofu barrier can execute an arbitrary communication algorithm

Recursive-doubling algorithm uses log2(n) of BGs in each process

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

Process 0

Process 1

Process 2

Process 3

Process 4

Process 5

Process 6

Process 7

Process 8

Process 9

Barrier Channel

Barrier GateRecursive Doubling7

Page 9: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Tofu Barrier (2)

Reduce-broadcast algorithm uses a maximum of 5 BGs in each process

In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier

8 BCHs and 64 BGs per node

Up to 8 communicators per node can use Tofu barrier simultaneously

Intra-node synchronization is recommended to be performed using software to reduce consumption of BCH and BG

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

Process 0

Process 1

Process 2

Process 3

Process 4

Process 5

Process 6

Process 7

Process 8

Process 9Barrier Channel

Barrier GateReduce-Broadcast Tree

8

Page 10: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Networks of Recent World-class Systems

All systems have the same order of bisection bandwidth

No significant performance difference in global data exchange

Torus networks have higher total injection bandwidth

Topology-aware communication such as nearest-neighbor data exchange results in higher performance

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

System Network Total Injection

Bandwidth (PB/s)

Bisection

Bandwidth (TB/s)

Blue Gene/Q Torus (5D) 1.97 49

K Computer Mesh/Torus (6D)

Virtual Torus (3D)

1.66 46

34

Sunway TaihuLight Tapered Fat-Tree 0.51 70

Piz Daint Dragonfly 0.07 36

Summit Fat-Tree 0.12 115

Oakforest-PACS Fat-Tree 0.10 102

40X

36X

7.3X

2.0X

1.0X

1.0X

9

Page 11: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Higher-density Node Configuration

Link Configuration and Injection Bandwidth

Packaging

Dynamic Packet Slicing

Increased Tofu Barrier Resources

The Design of TofuD

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 10

Page 12: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Higher-density Node Configuration

The CPU is smaller and the off-chip channels are halved

The number of 3D-stacked memories was halved from 8 to 4

Each Tofu link was reduced from 4 lanes to 2 lanes

More resources are integrated into the CPU

The number of CPU Memory Groups (NUMA nodes) doubled from 2 to 4

The number of Tofu Network Interfaces increased from 4 to 6

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

PCIe

TNI0

TNI1

TNI2

TNI3

To

fu N

etw

ork

Rou

terc c c c c c c c

c c c c c c c c

c

c c c c c c c c

c c c c c c c c

c

HMC HMC HMC HMC

HMC HMC HMC HMC

4 lan

es ×

10 p

ort

s

SPARC64 XIfxC M G

C M G Tofu2

PCIe

TNI0

NOC

cccc

c ccc

ccccc

cccc

cccc

ccc

cc

HBM2

cccc

c ccc

cc

ccc

HBM2

HBM2

cccc

cccc

ccc

cc

HBM2

TNI1

TNI2

TNI3

TNI4

TNI5

lan

es ×

10

po

rts

2la

ne

s ×

10

po

rts

To

fu N

etw

ork

Rou

ter

A64FXC M G C M G

C M G C M G TofuD

11

Page 13: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Copyright 2018 FUJITSU LIMITED

Data transfer rate increased from 25 Gbps to 28 Gbps

Link bandwidth reduced from 12.5 GB/s to 6.8 GB/s

TofuD simultaneously transmits in 6 directions

Increased from 4 directions in the case of Tofu1 and Tofu2

Total injection bandwidth per node is 40.8 GB/s

Approximately, twice that of Tofu1 or 80% that of Tofu2

Link Configuration and Injection Bandwidth

September 11th, 2018, IEEE Cluster 2018

Tofu1 Tofu2 TofuD

Data rate (Gbps) 6.25 25.78125 28.05

Number of signal lanes per link 8 4 2

Link bandwidth (GB/s) 5.0 12.5 6.8

Number of TNIs per node 4 4 6

Injection bandwidth per node (GB/s) 20 50 40.8

12

Page 14: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Packaging – CPU Memory Unit of Post-K

Two CPUs connected with C-axis

X×Y×Z×A×B×C = 1×1×1×1×1×2

Two or three active optical cable (AOC) cages on the board

Each cable bundles two lanes of signals from each of the two CPUs

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

CPU

CPU

AOC (X)

AOC (Y)

AOC (Z)

AOC

AOC

13

Page 15: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Packaging – Rack Structure of Post-K

Rack

8 shelves

192 CMUs or 384 CPUs

Shelf

24 CMUs or 48 CPUs

X×Y×Z×A×B×C = 1×1×4×2×3×2

Top or bottom half of rack

4 shelves

X×Y×Z×A×B×C = 2×2×4×2×3×2

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

Rack

Shelves

14

Page 16: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Dynamic Packet Slicing – Split Mode

An upper layer in TofuD slices packets for each signal lane

Each signal lane of TofuD has an independent physical layer

In the ordinary multi-lane transmission, the physical layer has media-independent interface and hides the number of signal lanes

For virtual cut-through packet transfer, the routing header is copied to both slices of the packet

This normal operation mode is called split mode

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

Slice 0

Slice 1

Packet Routing

Header

Slice 0

Slice 1

15

Page 17: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Dynamic Packet Slicing – Duplicate Mode

The upper layer duplicates packets when the error rate is high

This fall-down mode is called duplicate mode

The link can recover to the split mode

Each lane is never disconnected independently

The error rates of both lanes are continuously monitored and fed back

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

Packet

Packet

Packet Routing

Header

Slice 0

Slice 1

Error rate feedback

16

Page 18: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Increased Tofu Barrier Resources

The number of resources significantly increased

All 6 TNIs of TofuD have Tofu barrier

This change intended to support synchronization between CMGs

The ratio of the BCHs to BGs was increased from 1:8 to 1:3

Assuming an increase in the usage of the reduce-broadcast tree algorithm

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

Tofu1 Tofu2 TofuD

Node

Number of BCHs 8 8 96

Number of BGs 64 64 288

Number of TNIs 4 4 6

Number of CMGs 1 2 4

TNINumber of BCHs 8 8 16

Number of BGs 64 64 48

17

Page 19: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Evaluation Environment

Put Latencies

Latency Breakdown

Put Throughputs

Injection Rates

Tofu Barrier

Performance Evaluations

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 18

Page 20: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Evaluation Environment

TofuD

Evaluated by hardware emulators using the production RTL codes

• The simulation models were system-level and included multiple nodes

• Simulated processors executed test programs

• The test programs directly accessed the TofuD hardware

Results were measured from simulation waveforms

Tofu1 and Tofu2

Evaluations used real machines

• Real processors executed test programs

• The test programs used low-level communication libraries

Results were measured using the processor’s cycle counter

Only latency breakdowns were obtained from simulation waveforms

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 19

Page 21: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Put Latencies

8B Put transfer between nodes on the same board

The low-latency features were used

Tofu2 reduced the Put latency by 0.20 μs from that of Tofu1

The cache injection feature contributed to this reduction

TofuD reduced the Put latency by 0.22 μs from that of Tofu2

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

Communication settings Latency

Tofu1 Descriptor on main memory 1.15 µs

Direct Descriptor 0.91 µs

Tofu2 Cache injection OFF 0.87 µs

Cache injection ON 0.71 µs

TofuD To/From far CMGs 0.54 µs

To/From near CMGs 0.49 µs

0.20 µs

0.22 µs

20

Page 22: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Latency Breakdown

The overhead increase in Tofu2 has been reducedCopyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

0

100

200

300

400

500

600

700

800

900

1000

Tofu1 Tofu2 TofuD

La

ten

cy (

ns

ec

)

Rx CPU

Rx Host bus

Rx TNI

Packet Transfer

Tx TNI

Tx Host bus

Tx CPU

Cache injection

Tx Optimization

Rx Optimization

Increased Overhead in Physical Layer

Overhead Reduced

21

Page 23: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Put Throughputs

One-way Put transfer between nodes on the same board

Measured saturated throughput values at message sizes over 1 MiB

All measured efficiencies exceed 90%

The Tofu interconnect family has high bandwidth efficiency

The maximum packet size is large enough to encapsulate an IP packet

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

Put throughput Efficiency

Tofu1 4.76 GB/s 95 %

Tofu2 11.46 GB/s 92 %

TofuD 6.35 GB/s 93 %

22

Page 24: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Injection Rates per Node

Simultaneous Put transfers to multiple nearest-neighbor nodes

Tofu1 and Tofu2 used 4 TNIs, and TofuD used 6 TNIs

The injection rate of TofuD was approximately 83% that of Tofu2

The efficiencies of Tofu1 were lower than 90%

Because of a bottleneck in the bus that connects CPU and ICC

The efficiencies of Tofu2 and TofuD exceeded 90 %

Integration into the processor chip removed the bottleneck

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

Injection rate Efficiency

Tofu1 (K) 15.0 GB/s 77 %

Tofu1 (FX10) 17.6 GB/s 88 %

Tofu2 45.8 GB/s 92 %

TofuD 38.1 GB/s 93 %

23

Page 25: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Tofu Barrier – Test Programs

The test program synchronized multiple BCHs in a node

Executed in one processor core to simplify the waveform analysis

The test programs used the following algorithms;

The reduce-broadcast tree algorithm for intra-TNI synchronization

The recursive doubling algorithm for inter-TNI synchronization

Simple estimations were also calculated

Accumulated logic circuit delays of BCH (0.48 μs) and BG (0.13 μs)

Considered only the number of communication stagesCopyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

Number of BCHs 1 4 8 16 48

Number of used TNIs 1 4 6 6 6

Max. number of BCHs per TNI 1 1 2 3 8

Max. number of BGs per TNI 2 2 5 9 24

Number of communication stages 2 2 4 6 9

24

Page 26: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Tofu Barrier – Results

The simple estimate results were too low

Missing consideration of the serialization of BCHs/BGs processing

The modified estimates were consistent with the evaluation results

BCHs need to be allocated in a round-robin manner to avoid sharing a TNI

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018

0.0

1.0

2.0

3.0

4.0

5.0

1 4 16 64

Late

ncy (

μse

c)

Number of BCHs per node

Estimate considering serialization

Evaluation Result

Simple Estimate

un

de

restim

ate

25

Page 27: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Summary

TofuD is developed for the post-K machine

TofuD is designed to achieve high-density nodes and enhanced resilience with dynamic packet slicing

The design of TofuD

Node and link configurations

CMU and rack packaging

Dynamic Packet Slicing

Increased Tofu Barrier Resources

The evaluation results of TofuD

Latency was 0.49 μs, which was reduced by 0.22 μs from that of Tofu2

Throughput was 6.35 GB/s and the efficiency exceeded 90%

Injection rate was 38.1 GB/s, which was approximately 83% that of Tofu2

The evaluation results showed that it is necessary to allocate BCHs without sharing a TNI

Copyright 2018 FUJITSU LIMITEDSeptember 11th, 2018, IEEE Cluster 2018 26

Page 28: The Tofu Interconnect Dတတတတတတတတ - Fujitsu...Reduce-broadcast algorithm uses a maximum of 5 BGs in each process In Tofu1 and Tofu2, only TNI #0 had a Tofu barrier 8

Copyright 2018 FUJITSU LIMITED