Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
GPUスパコンTSUBAMEによるペタスケール物理シミュレーションGPUスパコンTSUBAMEによる
ペタスケール物理シミュレーション
1
東京工業大学 学術国際情報センター
青 木 尊 之
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
2
What is GPU? What is GPU?
High PerformanceHigh Performance
InexpensiveInexpensive
Low Electronic ConsumptionLow Electronic Consumption
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
MEXT MinisterAward(2012)
10TF
100TF
1PF
10PF
100PF
2007 2009 2011 2013
TSUBAME1.1 1.2
2.0
K‐Computer10.5 PF
2.5
4位 5位 5位 14位
7位 9位 14位
TSUBAME SupercomputerTSUBAME Supercomputer
25位 30位41位 64位
17PFSingle precision
Graph 500No.3 (2011)
TSUBAME 2.0
49.5TF
109.7TF163.2TF
2287.6TF
No.1 in Japan
Wire
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
4
TOP 500 Ranking 2013 November
MIC Xeon Phi
GPU K20X
MIC Xeon Phi
GPU K20X
GPU K20X
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUGPU: Fermi core → Kepler coreGPU: Fermi core → Kepler core
5
TSUBAME 2.0NVIDIA Tesla M2050×4224
TSUBAME 2.5NVIDIA Tesla K20X×4224
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Compute Node(2 CPUs, 3 GPUs)
Performance: 4.08 TFLOPSMemory: 58.0GB(CPU)
+18 GB(GPU)
Rack (30 nodes)
Performance: 122 TFLOPSMemory: 2.28 TB
System (58 racks)1442 nodes: 2952 CPU sockets,
4264 GPUs
Performance: 224.7 TFLOPS (CPU) ※ Turbo boost5.562 PFLOPS (GPU)
Total: 5.7 PFLOPS (double precision)17.1 PFLOPS (single precision)
Memory: 116 TB
TSUBAME 2.5TSUBAME 2.5
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Green500 Ranking 2013 NovemberGreen500 Ranking 2013 November
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUCUDACUDA
(Compute Unified Device Architecture)
8
■ GPU Computing (GPGPU) framework
C/C++ , FORTRAN Language + GPGPU extension(Compiler)Runtime APILibraries, Documentation, sample programs
CUDA is available for NVIDIA GPUs
■ Currently CUDA 5.5 is released.
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUBandwidth BottlenecksBandwidth Bottlenecks
GPU
PCI Express 2.0 x1664 Gbps (8 GB/sec)
Memory
VRAM
DDR3-133332 GB/sec Tesla K20X
InfiniBand QDR32 Gbps (4 GB/sec)
1000BASE-T0.125 GB/sec
9
Memory Interface 384bit250 GB/sec
■ Memory Hierarchy■ Memory Hierarchy
CPU
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Domain decompositionDomain decomposition
10
x
2D decomposition
sub-domain #1
GPU
CPU(2) CPU → CPU
(1) GPU→CPU (3) CPU→GPUPCI Express (cudaMemcpy)
MPI
sub-domain #2
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Overlapping between Computation and Communication
Overlapping between Computation and Communication
SendBuffer MPI RecBufferSync
Stream 2
Stream 1
Asynchronous Data transfer Asynchronous Data transfer
11 Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Japanese
Weather NewsJapanese
Weather News
気象庁(http://www.jma.go.jp/jma/index.html)
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUWeather PredictionWeather Prediction
Meso-scale Atmosphere Model:Cloud Resolving Non-hydrostatic modelCompressible equation taking consideration of sound waves.
Next Generation
a few kmMeso-scale
2000 kmTyphoon Tornado, Down burst
Heavy Rain
Collaboration: Japan Meteorological Agency
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUAtmosphere ModelAtmosphere Model
Dynamical Process:Full 3-D Navior-Stokes Equation
Physical Process:Cloud Physics, Moist, Solar Radiation, Condensation, Latent heat release, Chemical Process, Boundary Layer
“Parameterization” including sin, cos, exp, …in empirical rules.
Fgruuuu
)(21 P
t
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
*J. Michalakes, and M. Vachharajani: GPU Acceleration of Numerical Weather Prediction. Parallel Processing Letters Vol. 18 No. 4. World Scientific. Dec. 2008. pp. 531—548
**John C. Linford, John Michalakes, Manish Vachharijani, and Adrian Sandu. Multi-core acceleration of chemical kinetics for simulation and prediction, proceedings of the 2009 ACM/IEEE conference on supercomputing (SC'09), ACM, 2009.
WRF GPU ComputingWRF GPU ComputingWRF (Weather Research and Forecast)
WSM5 (WRF Single Moment 5-tracer) Microphysics*Represents condensation, precipitation and thermodynamic effects of latent heat release
1 % of lines of code, 25 % of elapsed time ⇒ 20 x boost in microphysics (1.2 - 1.3 x overall improvement)
WRF-Chem** provides the capability to simulate chemistry and aerosols from cloud scales to regional ⇒ x 8.5 increase
15
GPU
Dynamics PhysicsAccelerator Approach
Initial condition output
CPU
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUFull-GPU Implementation: ASUCAFull-GPU Implementation: ASUCA
16
Full GPU Approach
GPU
Dynamics PhysicsInitial condition
output
CPU
J. Ishida, C. Muroi, K. Kawano, Y. Kitamura, Development of a new nonhydrostaticmodel “ASUCA” at JMA, CAS/JSC WGNE Reserch Activities in Atomospheric and Oceanic Modelling.
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
17
Rewrite from Scratch
Original code at JMA
z,x,y (k,i,j)-ordering
FortranFortran
Changingarray order
C/C++C/C++
x,z,y (i,k,j)-ordering
CUDACUDA
GPU code
x,z,y (i,k,j)-ordering
Introducing many optimizations, overlapping the computation with the communication, kernel fuse, re-ordering kernel, . . .
1 Year by one Ph.D student
Entire Porting Fortran to CUDAEntire Porting Fortran to CUDA
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
18
TSUBAME 2.0 (1 GPU)TSUBAME 2.0 (1 GPU)
50x speedup compared to 1 CPU core
12x speedup to compared to XeonX5670 1 socket
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
19
TSUBAME 2.0 Weak ScalingTSUBAME 2.0 Weak Scaling
145.0 TflopsSingle precision
76.1 TflopsDoublele precision
Fermi core TeslaM2050
3990 GPU
■ SC’10 Technical PaperBest Student Award finalist
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
20
ASUCA Typhoon Simulation5km-horizontal resolution 479×466×48 ASUCA Typhoon Simulation5km-horizontal resolution 479×466×48
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
21
ASUCA Typhoon Simulation500m-horizontal resolution 4792×4696×48Using 437 GPUs
ASUCA Typhoon Simulation500m-horizontal resolution 4792×4696×48Using 437 GPUs
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Air flow in a wide area of Tokyo by using Lattice Boltzmann Method
Air flow in a wide area of Tokyo by using Lattice Boltzmann Method
22
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
23
Lattice Boltzmann MethodLattice Boltzmann Method eq
iiiii ffftf
1e
uuueue 2
242 2
32931
cccwf iii
eqi
12
3
4
5
6
78
9
1112
1314
15
16
17
18
10
Collision step: Streaming step:
Strongly Memory Bound Problem:
relaxation time
: kinetic viscosity (SGS)
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
LES (Large-Eddy Simulation)LES (Large-Eddy Simulation)
Molecular viscosity andEddy viscosity
Energy spectrum
SGSSGSGSGS
Relaxation time for LES model
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUSGS ModelSGS Model
Dynamic Smagorinsky model
Coherent-Structure Smagorinsky model
○ Simpleinaccurate for the flow with wall boundaryemperical tuning for the constant model coefficient
○ applicable to wall boundarycomplicated calculationaverage process over the wide area→ not available for complex shaped body→ not suitable for large-scale problem
model coefficient○ applicable to wall boundary○ model coefficient is locally
determined.
Smagorinsky model
→ model coefficient determined by the second invariant ofthe velocity gradient tensor
*H.Kobayashi, Phys. Fluids.17, (2005).*H.Kobayashi, Phys. Fluids.17, (2005).
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUComputational AreaComputational Area
Major part of TokyoIncluding Shnjuku-ku, Chiyoda-ku, Minato-ku, Meguro-ku, Chuou-ku,
10km×10km
Building Data:Pasco Co. Ltd.TDM 3D
26
Map ©2012 Google, ZENRIN
Shinjyuku Tokyo
Shinagawa
Shibuya
27 Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUWeak ScalabilityWeak Scalability
28
600 TFLOPSon 4000 GPUs
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUTSUBAME 2.0 → 2.5 TSUBAME 2.0 → 2.5
Number of GPUs
Perf
orm
ance
[TFL
OPS
]
▲ TSUBAME 2.5 (overlap)● TSUBAME 2.0 (overlap)
TSUBAME 2.51142 TFLOPS(3968 GPUs)
288 GFLOPS/GPU
TSUBAME 2.0149 Tflops(1000 GPUs)149 GFlops / GPU
×1.93
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
30
DriVer: BMW-AudiLehrstuhl für Aerodynamik und StrömungsmechanikTechnische Universität München
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
31 Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUDevelopment New MaterialsDevelopment New Materials
Material Microstructure
Dendritic Growth
Mechanical Structure
Improvement of fuel efficiency by reducing the weight of transportation
Developing lightweight strengthening material by controlling microstructure
Low-carbon society
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUPhase-Field ModelPhase-Field Model
The phase-field model is derived from non-equilibrium statistical physics and f = 0 represents the phase A and f = 1 for phase B.
Phase-field
0
1
Phase A
diffusive interfacewith finite thickness
Phase B
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUAl-Si: Binary AlloyAl-Si: Binary Alloy
Time evolution of the phase-field (Allen-Cahn equation)
Time evolution of the condensation: c
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
35
Finite Difference MethodFinite Difference Method
kji ,,19 points to solve kjic ,,7 points to solve
1 kz
kz
1 kz
Phase Field : Condensation : c
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Requirement for Peta-scale Computing
Requirement for Peta-scale Computing
36
2D computation×1000 large-scale computation
Previous works
3D computation
Single dendrite ~ mm scale
on TSUBAME 2.0
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Domain decompositionDomain decomposition
37
x
2D decomposition
sub-domain #1
GPU
CPU(2) CPU → CPU
(1) GPU→CPU (3) CPU→GPUPCI Express (cudaMemcpy)
MPI
sub-domain #2
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Weak scalingWeak scaling
38
single precision
4096 x 6400 x 128004000 (40 x 100) GPUs16,000 CPU cores
GPU-Only
Hybrid-YZ
Hybrid-Y
(No overlapping)
(y boundary by CPU)
(y,z boundary by CPU)
□
◯
▲
• Mesh size: 4096 x160x128/GPU NVIDIA Tesla M2050 card / Intel Xeon X5670 2.93 GHz on TSUBAME
2.0
Hybrid-Y method 2.0000045 PFlops
GPU: 1.975 PFlopsCPU: 24.69 TFlops
Efficiency 44.5%(2.000 PFlops / 4.497 PFlops)
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
TSUBAME 2.0 → 2.5 性能向上TSUBAME 2.0 → 2.5 性能向上
Mesh size of a subdomain(1GPU + 4 CPU cores): 4096 x 162 x 130
TSUBAME 2.5 3.406 PFLOPS
TSUBAME 2.02.000 PFLOPS4,096 x 6,480 x 13,000
4,096 x 5,022 x 16,640
×1.75
(4,000 GPUs+16,000 CPU cores)
(3,968 GPUs+15,872 CPU cores)
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
41
Dendrite AnalysisDendrite Analysis
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Contact interaction
Normal direction
Tangential direction
SpringViscosity
Viscosity
Friction
Spring
ijijij xkxF
Simulation for Granular MaterialsSimulation for Granular MaterialsDEM (Distinct Element Method)DEM (Distinct Element Method)
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
De-fragmentation of GPU Memory Load balance among GPUs to keep parallel efficiency
Many particles
Few particles
DEM Simulation on Multi-GPU DEM Simulation on Multi-GPU
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Particle search on GPU to move boundary
New boundary line
1th search
2th searchOld boundary line
Particles Across BoundariesParticles Across Boundaries
Node1 Node2
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
1 1 1 1 1
1 1 132
Some particles are sent to neighbors.
De-fragmentationDe-fragmentation
Re-numbering is executed on CPU.
1 1 1 1 1
Some particles newly come.
1 1 132 1 1
Node1
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Every 2000
Without sorting
Every 100 Every 1000 Every 20
Optimization of De-fragmentation frequencyOptimization of De-fragmentation frequency
Computational time increase with linear by re-numbering.
Time variation
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Computational field are divided into 64 sub-domains.Single GPU is assigned to each sub-domain.
Benchmark TestBenchmark Test
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Neighbor Particle ListNeighbor Particle List
Local domain0 6
3
0
0 6 3
NULL
87 percent of memory usage is reduced compared to regular neighbor list.
Linked-list methodLinked-list method
Granular Convey
64 GPUs, 5 million particles
粉体の撹拌シミュレーション
64 GPUs, 5 million particles
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUGas-Liquid Two-Phase FlowGas-Liquid Two-Phase Flow
■ Navier-Stokes solver:Fractional Step■ Time integration:3rd TVD Runge-Kutta■ Advection term:5th WENO■ Diffusion term:4th FD■ Poisson:MG-BiCGstab■ Surface tension:CSF model■ Surface capture:CLSVOF(THINC + Level-Set)
Particle Methodex. SPH
Mesh Method (Surface Capture)
Low accuracy< 106-7 particles
High accuracy > 108-9 mesh points
not splash
Numerical noise and unphysical oscillationCopyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
: Level-Set function(distance function)
: Heaviside function
The Level-Set methods (LSM) use the signed distancefunction to capture the interface. The interface isrepresented by the zero-level set (zero-contour).
Re-initialization for Level-Set function
Fig. Takehiro Himeno, et. Al., JSME, 65-635,B(1999),pp2333-2340
Level-Set method (LSM)Level-Set method (LSM)
Advantage : Curvature calculation, Interface boundaryDrawback : Volume conservation
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Continuous Surface Force(CSF) model by Brackbill, Kothe and Zemach (1991)
Continuous Surface Force(CSF) model by Brackbill, Kothe and Zemach (1991)
Surface tension represented by volume force
Surface tensionforce
Normal vector
Curvature
Approximate delta function
The interfacial surface force is transformed to a volume force in the region near the interface via a delta function
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
Anti-diffusive Interface CaptureAnti-diffusive Interface Capture
・VOF(volume of fluid) type interface capturing method・Flux from tangent of hyperbola function・Semi-Lagrangian time integration
[ Xiao, etal, Int. J. Numer. Meth. Fluid. 48(2005)1023 ]
・1D implementation can be applied to 2D & 3D → Simple
・Finite Volume like usage* THINC is the method how to compute flux
→ 3 krenel (x, y, z) can be fused to 1 kernel. Merit in memory R/W
THINC (tangent of hyperbola for interface capturing) Scheme
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
55
Sparse Matrix SolverSparse Matrix Solver
Ax = b for
××
××
××
××
××
××
××
××
××
××
××
××
××
××
××
××
××
××
××
×
××
××
××
××
××
××
××
××
××
×
××
××
××
××
××
××
×
×
×
××
××
××
××
××
××
××
××
××
××
××
××
××
××
××
×
×
×
×
×
×
×
×
Krylov sub-space methods:CG, BiCGStab, GMRes, , ,
Pre-conditioner:Incomplete Cholesky, ILU, MG, AMG, Block Diagonal Jacobi
Non-zero Packing:CRS → ELL, JDL
tp
u1
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
56
BiCGStab + MGBiCGStab + MGCollaboration with Mizuho Information & Research Institute0k 0
100 AxbMpr Set
for k = 0; k < N; k++;
k
kk ApMr
rr1
0
0
,, kkkk ApMrq 1
kk
kkk AqMAqM
AqMq11
1
,,
kkkkkk qpxx 1
kkkk AqMqr 11
bbrr ,, 211 kk
kk
kk ApMr
rr1
0
10
,,
kkkkkk ApMprp 111
if exit;
loop end
Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPU
57
MG V-CycleMG V-Cyclen Step n+1 Step
0G
1G
2G
3G
4G
000 SfL Poisosn Eq.
111 RvL Correction Eq.
111 RvL
222 RvL 222 RvL
333 RvL
444 RvL
Restriction
Restriction
Restriction
Restriction Prolongation
333 RvL
Prolongation
Prolongation
Prolongation
Smoother : Red & Black ILUAF : Fine MatrixAC : Corse MatrixR : RestrictionP : Prolongation
AC = RAFP
Milk Crown
Drop on dry floor
60
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
GP GPUGP GPUIndustrial Appl. Steering OilIndustrial Appl. Steering Oil
62
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology 63
SUMMARYSUMMARY
■ GPU Green computing : less electrical powerto get application results.
■ Peta-scale GPU applications successfully running on TSUBAME2.0/2.5
・ Meso-scale weather prediction ・ Lattice Boltzmann Method・ Gas-Liquid Two-phase flow … on going・ Phase-Field model for developing new
materials