Low Electronic ConsumptionTitle GPUスパコンTSUBAMEによるペタスケール物理シミュレーション Author 青木尊之 (東京工業大学・学術国際情報センター

Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology

GP GPUGP GPU

GPUスパコンTSUBAMEによるペタスケール物理シミュレーションGPUスパコンTSUBAMEによる

ペタスケール物理シミュレーション

1

東京工業大学学術国際情報センター

青木尊之


GP GPUGP GPU

2

What is GPU? What is GPU?

High PerformanceHigh Performance

InexpensiveInexpensive

Low Electronic ConsumptionLow Electronic Consumption

Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology

GP GPUGP GPU

MEXT MinisterAward(2012)

10TF

100TF

1PF

10PF

100PF

2007 2009 2011 2013

TSUBAME1.1 1.2

2.0

K‐Computer10.5 PF

2.5

４位 5位 5位 14位

7位 9位 14位

TSUBAME SupercomputerTSUBAME Supercomputer

25位 30位41位 64位

17PFSingle precision

Graph 500No.3 (2011)

TSUBAME 2.0

49.5TF

109.7TF163.2TF

2287.6TF

No.1 in Japan

Wire


GP GPUGP GPU

4

TOP 500 Ranking 2013 November

MIC Xeon Phi

GPU K20X

MIC Xeon Phi

GPU K20X

GPU K20X


GP GPUGP GPUGPU: Fermi core → Kepler coreGPU: Fermi core → Kepler core

5

TSUBAME 2.0NVIDIA Tesla M2050×4224

TSUBAME 2.5NVIDIA Tesla K20X×4224


GP GPUGP GPU

Compute Node(2 CPUs, 3 GPUs)

Performance: 4.08 TFLOPSMemory: 58.0GB(CPU)

+18 GB(GPU)

Rack (30 nodes)

Performance: 122 TFLOPSMemory: 2.28 TB

System (58 racks)1442 nodes: 2952 CPU sockets,

4264 GPUs

Performance: 224.7 TFLOPS (CPU) ※ Turbo boost5.562 PFLOPS (GPU)

Total: 5.7 PFLOPS (double precision)17.1 PFLOPS (single precision)

Memory: 116 TB

TSUBAME 2.5TSUBAME 2.5


GP GPUGP GPU

Green500 Ranking 2013 NovemberGreen500 Ranking 2013 November


GP GPUGP GPUCUDACUDA

(Compute Unified Device Architecture)

8

■ GPU Computing (GPGPU) framework

C/C++ , FORTRAN Language + GPGPU extension(Compiler)Runtime APILibraries, Documentation, sample programs

CUDA is available for NVIDIA GPUs

■ Currently CUDA 5.5 is released.


GP GPUGP GPUBandwidth BottlenecksBandwidth Bottlenecks

GPU

PCI Express 2.0 x1664 Gbps (8 GB/sec)

Memory

VRAM

DDR3-133332 GB/sec Tesla K20X

InfiniBand QDR32 Gbps (4 GB/sec)

1000BASE-T0.125 GB/sec

9

Memory Interface 384bit250 GB/sec

■ Memory Hierarchy■ Memory Hierarchy

CPU


GP GPUGP GPU

Domain decompositionDomain decomposition

10

x

2D decomposition

sub-domain #1

GPU

CPU(2) CPU → CPU

(1) GPU→CPU (3) CPU→GPUPCI Express (cudaMemcpy)

MPI

sub-domain #2


GP GPUGP GPU

Overlapping between Computation and Communication

Overlapping between Computation and Communication

SendBuffer MPI RecBufferSync

Stream 2

Stream 1

Asynchronous Data transfer Asynchronous Data transfer

11 Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology

GP GPUGP GPU

Japanese

Weather NewsJapanese

Weather News

気象庁（http://www.jma.go.jp/jma/index.html）


GP GPUGP GPUWeather PredictionWeather Prediction

Meso-scale Atmosphere Model:Cloud Resolving Non-hydrostatic modelCompressible equation taking consideration of sound waves.

Next Generation

a few kmMeso-scale

2000 kmTyphoon Tornado, Down burst

Heavy Rain

Collaboration: Japan Meteorological Agency


GP GPUGP GPUAtmosphere ModelAtmosphere Model

Dynamical Process:Full 3-D Navior-Stokes Equation

Physical Process:Cloud Physics, Moist, Solar Radiation, Condensation, Latent heat release, Chemical Process, Boundary Layer

“Parameterization” including sin, cos, exp, …in empirical rules.

Fgruuuu

)(21 P

t


GP GPUGP GPU

*J. Michalakes, and M. Vachharajani: GPU Acceleration of Numerical Weather Prediction. Parallel Processing Letters Vol. 18 No. 4. World Scientific. Dec. 2008. pp. 531—548

**John C. Linford, John Michalakes, Manish Vachharijani, and Adrian Sandu. Multi-core acceleration of chemical kinetics for simulation and prediction, proceedings of the 2009 ACM/IEEE conference on supercomputing (SC'09), ACM, 2009.

WRF GPU ComputingWRF GPU ComputingWRF (Weather Research and Forecast)

WSM5 (WRF Single Moment 5-tracer) Microphysics*Represents condensation, precipitation and thermodynamic effects of latent heat release

1 % of lines of code, 25 % of elapsed time ⇒ 20 x boost in microphysics (1.2 - 1.3 x overall improvement）

WRF-Chem** provides the capability to simulate chemistry and aerosols from cloud scales to regional ⇒ x 8.5 increase

15

GPU

Dynamics PhysicsAccelerator Approach

Initial condition output

CPU


GP GPUGP GPUFull-GPU Implementation: ASUCAFull-GPU Implementation: ASUCA

16

Full GPU Approach

GPU

Dynamics PhysicsInitial condition

output

CPU

J. Ishida, C. Muroi, K. Kawano, Y. Kitamura, Development of a new nonhydrostaticmodel “ASUCA” at JMA, CAS/JSC WGNE Reserch Activities in Atomospheric and Oceanic Modelling.


GP GPUGP GPU

17

Rewrite from Scratch

Original code at JMA

z,x,y (k,i,j)-ordering

FortranFortran

Changingarray order

C/C++C/C++

x,z,y (i,k,j)-ordering

CUDACUDA

GPU code

x,z,y (i,k,j)-ordering

Introducing many optimizations, overlapping the computation with the communication, kernel fuse, re-ordering kernel, . . .

1 Year by one Ph.D student

Entire Porting Fortran to CUDAEntire Porting Fortran to CUDA


GP GPUGP GPU

18

TSUBAME 2.0 (1 GPU)TSUBAME 2.0 (1 GPU)

50x speedup compared to 1 CPU core

12x speedup to compared to XeonX5670 1 socket


GP GPUGP GPU

19

TSUBAME 2.0 Weak ScalingTSUBAME 2.0 Weak Scaling

145.0 TflopsSingle precision

76.1 TflopsDoublele precision

Fermi core TeslaM2050

3990 GPU

■ SC’10 Technical PaperBest Student Award finalist


GP GPUGP GPU

20

ASUCA Typhoon Simulation5km-horizontal resolution 479×466×48 ASUCA Typhoon Simulation5km-horizontal resolution 479×466×48


GP GPUGP GPU

21

ASUCA Typhoon Simulation500m-horizontal resolution 4792×4696×48Using 437 GPUs

ASUCA Typhoon Simulation500m-horizontal resolution 4792×4696×48Using 437 GPUs


GP GPUGP GPU

Air flow in a wide area of Tokyo by using Lattice Boltzmann Method

Air flow in a wide area of Tokyo by using Lattice Boltzmann Method

22


GP GPUGP GPU

23

Lattice Boltzmann MethodLattice Boltzmann Method eq

iiiii ffftf

1e

uuueue 2

242 2

32931

cccwf iii

eqi

12

3

4

5

6

78

9

1112

1314

15

16

17

18

10

Collision step: Streaming step:

Strongly Memory Bound Problem:

relaxation time

: kinetic viscosity (SGS)


GP GPUGP GPU

LES (Large-Eddy Simulation)LES (Large-Eddy Simulation)

Molecular viscosity andEddy viscosity

Energy spectrum

SGSSGSGSGS

Relaxation time for LES model


GP GPUGP GPUSGS ModelSGS Model

Dynamic Smagorinsky model

Coherent-Structure Smagorinsky model

○ Simpleinaccurate for the flow with wall boundaryemperical tuning for the constant model coefficient

○ applicable to wall boundarycomplicated calculationaverage process over the wide area→ not available for complex shaped body→ not suitable for large-scale problem

model coefficient○ applicable to wall boundary○ model coefficient is locally

determined.

Smagorinsky model

→ model coefficient determined by the second invariant ofthe velocity gradient tensor

*H.Kobayashi, Phys. Fluids.17, (2005).*H.Kobayashi, Phys. Fluids.17, (2005).


GP GPUGP GPUComputational AreaComputational Area

Major part of TokyoIncluding Shnjuku-ku, Chiyoda-ku, Minato-ku, Meguro-ku, Chuou-ku,

10km×10km

Building Data:Pasco Co. Ltd.TDM 3D

26

Map ©2012 Google, ZENRIN

Shinjyuku Tokyo

Shinagawa

Shibuya


GP GPUGP GPUWeak ScalabilityWeak Scalability

28

600 TFLOPSon 4000 GPUs


GP GPUGP GPUTSUBAME 2.0 → 2.5 TSUBAME 2.0 → 2.5

Number of GPUs

Perf

orm

ance

[TFL

OPS

]

▲ TSUBAME 2.5 (overlap)● TSUBAME 2.0 (overlap)

TSUBAME 2.51142 TFLOPS(3968 GPUs)

288 GFLOPS/GPU

TSUBAME 2.0149 Tflops(1000 GPUs)149 GFlops / GPU

×1.93


GP GPUGP GPU

30

DriVer: BMW-AudiLehrstuhl für Aerodynamik und StrömungsmechanikTechnische Universität München


GP GPUGP GPU


GP GPUGP GPUDevelopment New MaterialsDevelopment New Materials

Material Microstructure

Dendritic Growth

Mechanical Structure

Improvement of fuel efficiency by reducing the weight of transportation

Developing lightweight strengthening material by controlling microstructure

Low-carbon society


GP GPUGP GPUPhase-Field ModelPhase-Field Model

The phase-field model is derived from non-equilibrium statistical physics and f = 0 represents the phase A and f = 1 for phase B.

Phase-field

0

1

Phase A

diffusive interfacewith finite thickness

Phase B


GP GPUGP GPUAl-Si: Binary AlloyAl-Si: Binary Alloy

Time evolution of the phase-field （Allen-Cahn equation)

Time evolution of the condensation: c


GP GPUGP GPU

35

Finite Difference MethodFinite Difference Method

kji ,,19 points to solve kjic ,,7 points to solve

1 kz

kz

1 kz

Phase Field : Condensation : c


GP GPUGP GPU

Requirement for Peta-scale Computing

Requirement for Peta-scale Computing

36

2D computation×1000 large-scale computation

Previous works

3D computation

Single dendrite ～ mm scale

on TSUBAME 2.0


GP GPUGP GPU

Domain decompositionDomain decomposition

37

x

2D decomposition

sub-domain #1

GPU

CPU(2) CPU → CPU

(1) GPU→CPU (3) CPU→GPUPCI Express (cudaMemcpy)

MPI

sub-domain #2


GP GPUGP GPU

Weak scalingWeak scaling

38

single precision

4096 x 6400 x 128004000 (40 x 100) GPUs16,000 CPU cores

GPU-Only

Hybrid-YZ

Hybrid-Y

（No overlapping）

（y boundary by CPU）

（y,z boundary by CPU）

□

◯

▲

• Mesh size: 4096 x160x128/GPU NVIDIA Tesla M2050 card / Intel Xeon X5670 2.93 GHz on TSUBAME

2.0

Hybrid-Y method 2.0000045 PFlops

GPU: 1.975 PFlopsCPU: 24.69 TFlops

Efficiency 44.5%(2.000 PFlops / 4.497 PFlops)


GP GPUGP GPU

TSUBAME 2.0 → 2.5 性能向上TSUBAME 2.0 → 2.5 性能向上

Mesh size of a subdomain(1GPU + 4 CPU cores): 4096 x 162 x 130

TSUBAME 2.5 3.406 PFLOPS

TSUBAME 2.02.000 PFLOPS4,096 x 6,480 x 13,000

4,096 x 5,022 x 16,640

×1.75

(4,000 GPUs+16,000 CPU cores)

(3,968 GPUs+15,872 CPU cores)


GP GPUGP GPU

41

Dendrite AnalysisDendrite Analysis


GP GPUGP GPU

Contact interaction

Normal direction

Tangential direction

SpringViscosity

Viscosity

Friction

Spring

ijijij xkxF

Simulation for Granular MaterialsSimulation for Granular MaterialsDEM (Distinct Element Method)DEM (Distinct Element Method)


GP GPUGP GPU

De-fragmentation of GPU Memory Load balance among GPUs to keep parallel efficiency

Many particles

Few particles

DEM Simulation on Multi-GPU DEM Simulation on Multi-GPU


GP GPUGP GPU

Particle search on GPU to move boundary

New boundary line

1th search

2th searchOld boundary line

Particles Across BoundariesParticles Across Boundaries

Node1 Node2


GP GPUGP GPU

1 1 1 1 1

1 1 132

Some particles are sent to neighbors.

De-fragmentationDe-fragmentation

Re-numbering is executed on CPU.

1 1 1 1 1

Some particles newly come.

1 1 132 1 1

Node1


GP GPUGP GPU

Every 2000

Without sorting

Every 100 Every 1000 Every 20

Optimization of De-fragmentation frequencyOptimization of De-fragmentation frequency

Computational time increase with linear by re-numbering.

Time variation


GP GPUGP GPU

Computational field are divided into 64 sub-domains.Single GPU is assigned to each sub-domain.

Benchmark TestBenchmark Test


GP GPUGP GPU

Neighbor Particle ListNeighbor Particle List

Local domain0 6

3

0

0 6 3

NULL

87 percent of memory usage is reduced compared to regular neighbor list.

Linked-list methodLinked-list method

Granular Convey

64 GPUs, 5 million particles

粉体の撹拌シミュレーション

64 GPUs, 5 million particles


GP GPUGP GPUGas-Liquid Two-Phase FlowGas-Liquid Two-Phase Flow

■ Navier-Stokes solver：Fractional Step■ Time integration：3rd TVD Runge-Kutta■ Advection term：5th WENO■ Diffusion term：4th FD■ Poisson：MG-BiCGstab■ Surface tension：CSF model■ Surface capture：CLSVOF(THINC + Level-Set)

Particle Methodex. SPH

Mesh Method (Surface Capture)

Low accuracy< 106-7 particles

High accuracy > 108-9 mesh points

not splash

Numerical noise and unphysical oscillationCopyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology

GP GPUGP GPU

: Level-Set function(distance function)

: Heaviside function

The Level-Set methods (LSM) use the signed distancefunction to capture the interface. The interface isrepresented by the zero-level set (zero-contour).

Re-initialization for Level-Set function

Fig. Takehiro Himeno, et. Al., JSME, 65-635,B(1999),pp2333-2340

Level-Set method (LSM)Level-Set method (LSM)

Advantage : Curvature calculation, Interface boundaryDrawback : Volume conservation


GP GPUGP GPU

Continuous Surface Force(CSF) model by Brackbill, Kothe and Zemach (1991)

Continuous Surface Force(CSF) model by Brackbill, Kothe and Zemach (1991)

Surface tension represented by volume force

Surface tensionforce

Normal vector

Curvature

Approximate delta function

The interfacial surface force is transformed to a volume force in the region near the interface via a delta function


GP GPUGP GPU

Anti-diffusive Interface CaptureAnti-diffusive Interface Capture

・VOF(volume of fluid) type interface capturing method・Flux from tangent of hyperbola function・Semi-Lagrangian time integration

[ Xiao, etal, Int. J. Numer. Meth. Fluid. 48(2005)1023 ]

・1D implementation can be applied to 2D & 3D → Simple

・Finite Volume like usage＊ THINC is the method how to compute flux

→ 3 krenel (x, y, z) can be fused to 1 kernel. Merit in memory R/W

THINC (tangent of hyperbola for interface capturing) Scheme


GP GPUGP GPU

55

Sparse Matrix SolverSparse Matrix Solver

Ax = b for

××

××

××

××

××

××

××

××

××

××

××

××

××

××

××

××

××

××

××

×

××

××

××

××

××

××

××

××

××

×

××

××

××

××

××

××

×

×

×

××

××

××

××

××

××

××

××

××

××

××

××

××

××

××

×

×

×

×

×

×

×

×

Krylov sub-space methods:CG, BiCGStab, GMRes, , ,

Pre-conditioner:Incomplete Cholesky, ILU, MG, AMG, Block Diagonal Jacobi

Non-zero Packing:CRS → ELL, JDL

tp

u1


GP GPUGP GPU

56

BiCGStab + MGBiCGStab + MGCollaboration with Mizuho Information & Research Institute0k 0

100 AxbMpr Set

for k = 0; k < N; k++;

k

kk ApMr

rr1

0

0

,, kkkk ApMrq 1

kk

kkk AqMAqM

AqMq11

1

,,

kkkkkk qpxx 1

kkkk AqMqr 11

bbrr ,, 211 kk

kk

kk ApMr

rr1

0

10

,,

kkkkkk ApMprp 111

if exit;

loop end


GP GPUGP GPU

57

MG V-CycleMG V-Cyclen Step n+1 Step

0G

1G

2G

3G

4G

000 SfL Poisosn Eq.

111 RvL Correction Eq.

111 RvL

222 RvL 222 RvL

333 RvL

444 RvL

Restriction

Restriction

Restriction

Restriction Prolongation

333 RvL

Prolongation

Prolongation

Prolongation

Smoother : Red & Black ILUAF ： Fine MatrixAC ： Corse MatrixR : RestrictionP : Prolongation

AC = RAFP

Milk Crown

Drop on dry floor

60


GP GPUGP GPUIndustrial Appl. Steering OilIndustrial Appl. Steering Oil

62

Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology 63

SUMMARYSUMMARY

■ GPU Green computing : less electrical powerto get application results.

■ Peta-scale GPU applications successfully running on TSUBAME2.0/2.5

・ Meso-scale weather prediction ・ Lattice Boltzmann Method・ Gas-Liquid Two-phase flow … on going・ Phase-Field model for developing new

materials

Documents

Low Electronic ConsumptionTitle GPUスパコンTSUBAMEによるペタスケール物理シミュレーション Author 青木 尊之 (東京工業大学・学術国際情報センター

Low Electronic ConsumptionTitle GPUスパコンTSUBAMEによるペタスケール物理シミュレーション Author 青木尊之 (東京工業大学・学術国際情報センター