Appro Supercomputer Solutions - GPU Technology …on-demand.gputechconf.com/gtc/2012/presentations/S0618-GTC2012-… · Appro Supercomputer Solutions ... 2012 GTC Conference 6

Appro Supercomputer Solutions

Steven Lyness, VP HPC Solutions Engineering

Appro and Tsukuba University Accelerator Cluster Collaboration

Company Overview Appro Celebrates 20 Years of HPC Success….

About Appro Over 20 Years of Experience

Moving Forward….

2007 to 2012

End-To-End

Supercomputer Solutions

1991 – 2000

OEM Server

Manufacturer

2001-2007

Branded Servers

Clusters Solutions

Manufacturer

2

• Over 2 PFLOPs (peak) with just five Top100 systems added in to Top500 in November

• Variety of technologies: −Intel, AMD, NVIDIA

−Multiple server form factors

−Infiniband and GigE

−Fat Tree and 3D Torus

• Excellent Linpack efficiency on non-optimized SB systems

−85.5% Fat Tree

−83% - 85% 3D Torus

Appro on Top 500

3

Appro Milestones Installations in 2012

Site Peak Performance

Los Alamos (LANL) > 1.8 PFLOPs

Sandia (SNL) > 1.2 PFLOPs

Livermore (LLNL) > 1.5 PFLOPs

Japan (Tsukuba, Kyoto) > 1 PFLOPs

• HA-PACS (Highly Accelerated Parallel Advanced system for

Computational Sciences)

Apr. 2011 – Mar. 2014, 3-year project

Project Leader: Prof. M. Sato (Director, CCS, Univ. of Tsukuba)

• Develop next generation GPU system : 15 members

Project Office for Exascale Computing System Development

(Leader: Prof. T. Boku）

GPU cluster based on Tightly Coupled Accelerators architecture

• Develop large scale GPU applications : 15 members

Project Office for Exascale Computational Sciences

（Leader: Prof. M. Umemura）

Elementary Particle Physics, Astrophysics, Bioscience, Nuclear/Quantum

Physics, Global Environmental Science, High Performance Computing

5

About University Of Tsukuba HA-PACS Project

:: Problem Definition

University of Tsukuba- HA-PACS Project

• Many technology discussions to determine KEY :

Fixed budget

High Availability

Latest Processor / High Flops

1:2 CPU:Accelerator Ratio

High Bandwidth to the Accelerator

High bandwidth, low latency interconnect

Apps Could take advantage of “more than QDR IB”

High IO Bandwidth to storage

“Easy to Manage”

6 2012 GTC Conference

Solution Keys

Fixed Budget Considerations

Need to find a balance between:

Performance - Flops, bandwidth (memory, IO

Capacity (CPU Qty, GPU Qty, Memory per core, IO, Storage)

Availability Features

Ease of Management / Supportability

Architecture needed: High Availability

Nodes (PS, Fans)

IPC networks (Ex. InfiniBand)

Service Networks (Provisioning and Management)


What Appro Brings to NWS 8

Challenge: Create a Solution with High Availability

− Redundant power supplies

− Redundant hot swap fan trays

− Redundant Hot swap disk drives

− Redundant Networks

Solution: Appro Xtreme-X™ Supercomputer, flagship product-line using GreenBlade™ sub-rack component used for for the DoE TLCC2 project

Expand to add support for new custom blade nodes

Meeting Key Requirements

:: Appro Xtreme-X™ Supercomputer

Solution Architecture


Unified scalable cluster architecture that can

be provisioned and managed as a stand-alone

supercomputer.

Improved power & cooling efficiency to

dramatically lower total cost of ownership

Offers high performance and high availability

features with lower latency and higher

bandwidth.

Appro HPC Software Stack - Complete HPC

Cluster Software tools combined with the

Appro Cluster Engine™ (ACE) Management

Software including the following capabilities:

System Management

Network Management

Server Management

Cluster Management

Storage Management

10 Presentation Name

Optimal Performance


Peak Performance CPU Contribution

Sandy Bridge-EP 2.6 GHz E5-2670 Processor (332 GFlops per node)

GPU Contribution

665 GFlops per NVIDIA S2090

Four (4) S2090’s per node or 2.66 TFlops per node

Combined Peak Performance is 3 TFlops per node

Two Hundred and Sixty-Eight (268) nodes provides 802 TFlops

Accelerator Performance DEDICATED PCI-e Gen3 X16 for each NVIDIA GPU

Uses Gen2 so we have up to 8 GB/s per GPU available

IO Performance 2 x QDR (Mellanox CX3) – Up to 4GB/s per link (on PCI-e Gen3 X8) bus

GigE for Operations networks

Up to 4x 2P GB812X blades

− Expandability for HDD, SSD, GPU, MIC

Six Cooling Fan Units

− Hot swappable & redundant

Up to six 1600W power supplies

− Platinum-rated; 95%+ efficient

− Hot swappable & redundant

Support one or redundant iSCB platform manager modules with

Enhanced management capabilities

− Active & dynamic fan control

− Power monitoring

− Remote power control

− Integrated console server

Appro GreenBlade™ Sub-Rack With Accelerator Expansion Blades

Appro Confidential and Proprietary

Appro GreenBlade™ Subrack


• Server Board

−Increased memory footprint (2 DPC)

−Provides access to two (2) PCI-e Gen3 X16 PER SOCKET

• Provides for increased IO capability

−QDR or FDR InfiniBand on the motherboard

−Internal RAID Adapter on Gen3 bus

• Up to two (2) 2.5” Hard drives

NOTE: Can run diskless/stateless because of Appro Cluster Engine but needed local scratch

iSCB Modules

2012 GTC Conference

::

Challenge: Create a server node with

− Latest Generation of processors: Need for flops AND IO capacity

− HIGH bandwidth to the Accelerators

− High Memory capacity

Solution: High Bandwidth Intel Sandy Bridge-EP for CPU and the NVIDIA Tesla for GPU

Working with Intel® EPSD EARLY on to design a motherboard

− Washington Pass (S2600WP) Motherboard with:

Dual Sandy Bridge-EP (E5-2700) sockets

Expose four (4) PCI-e Gen3 X16 for Accelerator Connectivity

Expose one (1) PCI-e Gen3 X8 for Expansion slot/IO

Two (2) DIMMS Per channel (16 DIMMS total)

− 2U form factor for fit and air flow/cooling

13

Server Node Design


2012 GTC Conference

4 Channels 1,600 MHz

51.2 GB/sec

SNB-EN

4 Channels 1,600 MHz 51.2 GB/sec

EP

2xQDR IB

EP

Sandy Bridge QPI Patsburg

PCH

PCI-e X4 Gen 3

x 8

Dual GbE

GbE

Dual

BMC

BIOS ESI Sandy Bridge

DD

R3

DD

R3

DD

R3

DD

R3

QPI

Gen 3

x 1

6

DMI

Gen 3

x 8

D

DR3

DD

R3

DD

R3

DD

R3

Gen 3

x 1

6

4 x NVIDIA M2090

Gen 3

x 1

6

Gen 3

x 1

6


Intel® EPSD S2600WP Motherboard


2012 GTC Conference

GreenBlade Node Design

PAG

E |

15

QDR InfiniBand (Port 0)

GigE – Cluster Management / Operations Network (Prime)

QDR InfiniBand (Port 1)

GigE – Cluster Management / Operations Network (Secondary)

HDD0

HDD1

2012 GTC Conference

:: Network Availability


Challenge To provide cost effective redundant networks to eliminate/reduce failures (MTTI)

Solution − Build system with redundant operations Ethernet networks

Redundant on-board GigE each with access to IPMI

Redundant iSCB Modules for baseboard management, node control and monitoring

− Build system with redundant InfiniBand networks

DUAL QDR for price/performance

Selected Mellanox due to Gen3 X8 support (dual port adapter)


:: Operations Networking


17

Management

Node(s)

Compute Nodes

GbE

10GigE

10GigE Switch

External Network

Sub Management Node

(GreenBlade™ GB812X)

Login

Node(s)

Rack (1) , Rack (2)

and

Rack (3)

Rack (N-2) , Rack (N-1)

and

Rack (N)

Sub Management Node

(GreenBlade™ GB812X)

48 port Leaf

Switches

Compute Nodes

2012 GTC Conference

:: Ease of Use


Challenge • Need the System top install quickly to get into production

• Most have limited “people resources”

• Need to be able to keep the system running and doing science

Solution • Appro HPC Software Stack

− Tested and Validated

− Full stack from HW layer to Application layer

− Allows for quick bring up of a cluster


Appro HPC Software Stack

User Applications

Intel® Cluster

Studio PGI (PGI CDK) GNU PathScale

MVAPICH2 OpenMPI Intel® MPI-(Intel Cluster

Studio)

Appro Cluster Engine (ACE™) Virtual Clusters

Linux (Red Hat, CentOS, SuSE)

ACE ™

OS

Provisioning

Remote Power Mgmt PowerMan

Message Passing

Compilers

Console Mgmt ACE ™ ConMan

Grid Engine Job Scheduling PBS Pro

NFS (3.x Storage Lustre Local FS

(ext3, ext4, XFS)

ACE™ (iSCB and OpenIPMI) Cluster Monitoring

HPCC IOR netperf Performance Monitoring

Appro Xtreme-X™ Supercomputer – Building Blocks

Appro HPC Professional Services - On-site Installation services and/or Customized services

Appro Turn-Key Integration & Delivery Services HW and SW integration, pre-acceptance testing, dismantle, packing and shipping

Appro

HPC S

oft

ware

Sta

ck

Perfctr PAPI/IPM

PanFS

SLURM

2012 GTC Conference

:: Summary

Appro Key Advantages

• Partnering with Key technology partners to offer cutting-edge

integrated solutions:

− Performance

Storage IOR

Networking Bandwidth, latencies and message rates

− Features

High Availability (high standard MTBF, redundancy - PS)

Ease of Management

− Flexibility

− Price /Performance

− Training Programs

Pre-Sales (Sell everything it does and ONLY that)

Installation and Tuning

Post Install Support


::

Appro Corporate Presentation

Turn-Key Solution Summary

Appro Cluster Engine™ (ACE) Management Software Suite

Capability Computing

Hybrid Computing

Capacity Computing

Data Intensive

Computing

Appro Xtreme-X™ Supercomputer addressing 4 HPC Workload Configurations

Appro HPC Software Stack

Turn-Key Integration & Delivery Services

- Node, Rack, Switch, Interconnect, cable, network, storage, software, Burning-in - Pre-acceptance testing, performance validation, dismantle, packing and shipping

Appro HPC Professional Services - On-site Installation services and/or Customized services

Appro Xtreme-X™ Supercomputer

21

Appro Supercomputer Solutions

Questions?

Steve Lyness, VP HPC Solutions Engineering

Ask Now or see us at Table #54

Learn More at www.appro.com

http://www.appro.com/

Taisuke Boku

Center for Computational Sciences

University of Tsukuba [email protected]

HA-PACS Next Step for Scientific Frontier

by Accelerated Computing

2012/05/15

23 GTC2012, San Jose

Project plan of HA-PACS

HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences) Accelerating critical problems on various scientific fields in Center for Computational Sciences, University of Tsukuba

− The target application fields will be partially limited

− Current target: QCD, Astro, QM/MM (quantum mechanics / molecular mechanics, for life science)

Two parts − HA-PACS base cluster:

for development of GPU-accelerated code for target fields, and performing product-run of them

− HA-PACS/TCA: (TCA = Tightly Coupled Accelerators)

for elementary research on new technology for accelerated computing

Our original communication system based on PCI-Express named “PEARL”, and a prototype communication chip named “PEACH2”

2012/05/15

24

GTC2012, San Jose

GPU Computing: current trend of HPC

GPU clusters in TOP500 on Nov. 2011 − 2nd 天河Tienha-1A (Rpeak=4.70 PFLOPS)

− 4th 星雲Nebulae (Rpeak=2.98 PFLOPS)

− 5th TSUBAME2.0 (Rpeak=2.29 PFLOPS)

− (1st K Computer Rpeak=11.28 PFLOPS)

Features − high peak performance / cost ratio

− high peak performance / power ratio

− large scale applications with GPU acceleration don’t run yet in production on GPU cluster ⇒ Our First target is to develop large scale applications accelerated by GPU in real computational sciences

25

2012/05/15

GTC2012, San Jose

Problems of GPU Cluster

Problems of GPGPU for HPC − Data I/O performance limitation

Ex) GPGPU: PCIe gen2 x16

Peak Performance： 8GB/s (I/O) ⇔ 665 GFLOPS (NVIDIA M2090)

− Memory size limitation Ex) M2090: 6GByte vs CPU: 4 – 128

GByte

− Communication between accelerators: no direct path (external) ⇒ communication latency via CPU becomes large

Ex) GPGPU: GPU mem ⇒ CPU mem ⇒ (MPI) ⇒ CPU mem ⇒ GPU mem

Researches for direct communication between GPUs are required

26

Our another target is developing a direct communication system between external GPUs for a feasibility study for future accelerated computing

2012/05/15

GTC2012, San Jose

Project Formation

HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences)

Apr. 2011 – Mar. 2014, 3-year project

Project Leader: Prof. M. Sato (Director, CCS, Univ. of Tsukuba)

Develop next generation GPU system : 15 members

Project Office for Exascale Computing System Development (Leader: Prof. T. Boku）

GPU cluster based on Tightly Coupled Accelerators architecture

Develop large scale GPU applications : 15 members

Project Office for Exascale Computational Sciences （Leader: Prof. M. Umemura）

Elementary Particle Physics, Astrophysics, Bioscience, Nuclear/Quantum Physics, Global Environmental Science, High Performance Computing

27

2012/05/15

GTC2012, San Jose

HA-PACS base cluster (Feb. 2012)

2012/05/15

GTC2012, San Jose 28

HA-PACS base cluster

2012/05/15


Front view

Side view

HA-PACS base cluster

2012/05/15


Rear view of one blade chassis with 4 blades

Front view of 3 blade chassis

Rear view of Infiniband switch and cables (yellow=fibre, black=copper)

HA-PACS: base cluster (computation node)


(2.6GHz x 8flop/clock)

Total: 3TFLOPS

8GB/s

AVX

665GFLOPSx4 =2660GFLOPS

20.8GFLOPSx16 =332.8GFLOPS

(16GB, 12.8GB/s)x8 =128GB, 102.4GB/s

(6GB, 177GB/s)x4 =24GB, 708GB/s

2012/05/15

Intel Xeon E5 (SandyBridge-EP) x 2

− 8 cores/socket (16 cores/node) with 2.6 GHz

− AVX (256-bit SIMD) on each core ⇒ peak perf./socket = 2.6 x 4 x 2 = 166.4 GFLOPS ⇒ pek perf./node = 332.8 GFLOPS

− Each socket supports up to 40 lanes of PCIe gen3 ⇒ great performance to connect multiple GPUs without I/O performance bottleneck ⇒ current NVIDIA M2090 supports just PCIe gen2, but net generation (Kepler) will support PCIe gen3

− M2090 x4 can be connected to 2 SandyBridge-EP still remaining PCIe gen3 x8 x2 ⇒ Infiniband QDR x 2

2012/05/15


HA-PACS: base cluster unit（CPU）

HA-PACS: base cluster unit（GPU）

NVIDIA M2090 x 4

− Number of processor core: 512

− Processor core clock: 1.3 GHz

− DP 665 GFLOPS, SP 1331GFLOPS

− PCI Express gen2 ×16 system interface

− Board power dissipation: <= 225 W

− Memory clock: 1.85 GHz, size: 6GB with ECC, 177GB/s

− Shared/L1 Cache: 64KB, L2 Cache: 768KB


2012/05/15

HA-PACS: base cluster unit（blade node）


2x 2.6GHz 8core SandyBridge-EP

Air flow

2x NVIDIA Tesla M2090

1x PCIe slot for HCA

2x 2.5”HDD

2x NVIDIA Tesla M2090

Front view Rear view

Power Supply Unit and Fan - 8U enclosure - 4 nodes - 3 PSU(Hot Swappable) - 6 Fans(Hot

Swappable)

2012/05/15

Basic performance data

MPI pingpong

− 6.4 GB/s (N1/2= 8KB)

− with dual rail Infiniband QDR (Mellanox ConnectX-3)

− actually FDR for HCA and QDR for switch

PCIe benchmark (Device -> Host memory copy), aggregated perf. for 4 GPUs simultaneously

− 24 GB/s (N1/2= 20KB)

− PCIe gen2 x16 x4, theoretical peak = 8 GB/s x4 = 32 GB/s

Stream (memory)

− 74.6 GB/s

− theoretical peak = 102.4 GB/s

2012/05/15


PCIe Host:Device communication performance

2012/05/15


Slower start on Host->Device compared with Device->Host

HA-PACS Application (1)：Elementary Particle Physics

37

Multi-scale physics Finite temperature and density

Investigate hierarchical properties via direct construction of nuclei in lattice QCD GPU to solve large sparse linear systems of equations

Phase analysis of QCD at finite temperature and density GPU to perform matrix-matrix product of dense matrices

quark

proton neutron

nucleus

Expected QCD phase diagram

2012/05/15

GTC2012, San Jose

HA-PACS Applications (2)：Astrophysics

38

(A) Collisional N-body Simulation (B) Radiation Transfer

Computations of the accelerations of particles and their time derivatives (jerks) are time consuming.

Direct (brute force) calculations of acceleration and jerks are required to achieve the required numerical accuracy

Globular Clusters

Massive Black Holes in Galaxies

Accelerations and jerks are computed on GPU

• Understanding of the formation of massive black holes in galaxies

• Numerical simulations of complicated gravitational interactions between stars and multiple black holes in galaxy centers.

• Fossil object as a clue to investigate the primordial universe

• Formation of the most primordial objects formed more than 10 giga years.

Calculation of the physical effects of photons emitted by stars and galaxies onto the surrounding matter.

So far, poorly investigated due to its huge amount of computational cost, though it is of critical importance in the formation of stars and galaxies. Computations of the radiation intensity and the resulting chemical reactions based on the ray-tracing methods can be highly accelerated with GPUs owing to its high concurrency.

First Stars and Re-ionization of the Universe

Accretion Disks around Black Holes

• Understanding of the formation of the first stars in the universe and the succeeded re-ionization of the universe.

• Study of the high temperature regions around black holes

2012/05/15

GTC2012, San Jose

HA-PACS Application (3)：Bioscience

39

DNA-protein complex

（macroscale MD）

Reaction mechanisms

（QM/MM-MD）

QM region

> 100 atoms

GPU acceleration - Direct coulmb （Gromacs, NAMD, Amber) -2 electron integral

2012/05/15

GTC2012, San Jose

HA-PACS Application (4)

Other advanced researches on HPC Division in CCS

− XcalableMP-dev (XMP-dev) for easy and simple programming language to support distributed memory & GPU accelerated computing for large scale computational sciences

− G8 NuFuSE (Nuclear Fusion Simulation for Exascale) project platform for porting Plasma Simulation Code with GPU technology

− Climate simulation especially for LES (Large Eddy Simulation) for cloud-level resolution on city-model size simulation

− Any other collaboration ...

2012/05/15


HA-PACS: TCA (Tightly Coupled Accelerator)

TCA: Tightly Coupled Accelerator

− Direct connection between accelerators (GPUs)

− Using PCIe as a communication device between accelerator

Most acceleration device and other I/O device are connected by PCIe as PCIe end-point (slave device)

An intelligent PCIe device logically enables an end-point device to directly communicate with other end-point devices

PEARL: PCI Express Adaptive and Reliable Link

− We already developed such PCIe device (PEACH, PCI Express Adaptive Communication Hub) on JST-CREST project “low power and dependable network for embedded system”

− It enables direct connection between nodes by PCIe Gen2 x4 link

⇒ Improving PEACH for HPC to realize TCA GTC2012, San Jose 41

2012/05/15

PEACH

PEACH: PCI-Express Adaptive Communication Hub

An intelligent PCI-Express communication switch to use PCIe link directly for node-to-node interconnection

Edge of PEACH PCIe link can be connected to any peripheral devices, including GPU

Prototype PEACH chip − 4-port PCI-E gen.2 with x4 lane / port

− PCI-E link edge control feature: “root complex” and “end points” are automatically switched (flipped) according to the connection handling

− Other fault-tolerant (reliability) function is implemented: “flip network link” to allow single link fault

in HA-PACS/TCA prototype development, we will enhance current PEACH chip ⇒ PEACH2

2012/05/15

42

GTC2012, San Jose

HA-PACS/TCA (Tightly Coupled Accelerator)

Enhanced version of PEACH

⇒ PEACH2 − x4 lanes -> x8 lanes

− hardwired on main data path and PCIe interface fabric

PEACH2

CPU

PCIe

CPU

PCIe

Node

PEACH2

PCIe

PCIe

PCIe

GPU

GPU

PCIe

PCIe

Node

PCIe

PCIe

PCIe

GPU

GPU CPU

CPU

IB HCA

IB HCA

IB Switc

h

True GPU-direct

current GPU clusters require 3-hop communication (3-5 times memory copy)

For strong scaling, Inter-GPU direct communication protocol is needed for lower latency and higher throughput

MEM MEM

MEM

MEM MEM MEM

2012/05/15


Implementation of PEACH2: ASIC⇒FPGA

FPGA based implementation − today’s advanced FPGA allows to use PCIe

hub with multiple ports

− currently gen2 x 8 lanes x 4 ports are available ⇒ soon gen3 will be available (?)

− easy modification and enhancement

− fits to standard (full-size) PCIe board

− internal multi-core general purpose CPU with programmability is available ⇒ easily split hardwired/firmware partitioning on certain level on control layer

Controlling PEACH2 for GPU communication protocol

− collaboration with NVIDIA for information sharing and discussion

− based on CUDA4.0 device to device direct memory copy protocol

2012/05/15


HA-PACS/TCA Node Cluster = NC

PEACH2 C x 2

PEARL Ring Network

Infiniband Link

Node Cluster with 16 nodes • GPUx64 (G) • CPUx32 (C) • GPU comm with PCIe • IB link / node • CPU: Xeon E5 • GPU: Kepler

Node Cluster

Node Cluster

Node Cluster

Node Cluster

Node Cluster

Infiniband Network

........

4 NC with 16 nodes, or 8 NC with 8 nodes = 360 TFLOPS extension to base cluster

•High speed GPU-GPU comm. by PEACH within NC (PCI-E gen2x8 = 5GB/s/link) •Infiniband QDR (x2) for NC-NC comm. (4GB/s/link)

45

Gx4

PEACH2 C x 2

Gx4

PEACH2 C x 2

Gx4

.....

PEARL/PEACH2 variation (1)

46

C C

C C

C C

C C QPI

PCIe

GPU

GPU

GPU IB

HCA

GPU

PEACH2

PCIe SW

G2 x8

G3

x16

G3

x16

G3 x8

G3 x8

Option 1:

Performance comparison among IB and PEARL can be evenly compared

Additional latency by PCIe switch

2012/05/15

GTC2012, San Jose

PEARL/PEACH2 variation (2)

47

C C

C C

C C

C C QPI

PCIe

GPU

GPU

GPU

IB HCA

GPU

PEACH2

PCIe SW

G2 x8

G3

x16

G3

x16

G3 x16

G3

x16

G3 x8

Option 2:

− Requires only 72 lanes in total

− asymmetric connection among 3 blocks of GPUs

2012/05/15

GTC2012, San Jose

PEACH2 prototype board for TCA

2012/05/15


FPGA (Altera Stratix IV GX530)

PCIe external link connector x2 (one more on daughter board)

PCIe edge connector (to host server)

daughter board connector

power regulators for FPGA

Summary

HA-PACS consists of two elements: HA-PACS base cluster for application development and HA-PACS/TCA for elementary study for advanced technology on direct communication among accelerating devices (GPUs)

HA-PACS base cluster started its operation from Feb. 2012 with 802 TFLOPS peak performance (Linpack performance will come on June 2012, also expecting good score on Green500)

FPGA implementation of PEACH2 is finished for the prototype version on Mar. 2012 and enhanced for final version in following 6 months

HA-PACS/TCA with at least 300 TFLOPS additional performance will be installed around Mar. 2013

2012/05/15


Documents

Appro Supercomputer Solutions - GPU Technology …on-demand.gputechconf.com/gtc/2012/presentations/S0618-GTC2012-… · Appro Supercomputer Solutions ... 2012 GTC Conference 6