Upload
jerome-golden
View
225
Download
2
Tags:
Embed Size (px)
Citation preview
Stratified Magnetohydrodynamics
Accelerated Using GPUs:SMAUG
The Sheffield Advanced Code
• The Sheffield Advanced Code (SAC) is a novel fully non-linear MHD code based on the Versatile Advection Code (VAC) designed for simulations of linear and non-linear
wave propagation with gravitationally strongly stratified magnetised
plasma. Shelyag, S.; Fedun, V.; Erdélyi, R. Astronomy and
Astrophysics, Volume 486, Issue 2, 2008, pp.655-662
Full Perturbed MHD Equations for Stratified media
Numerical Diffusion
• Central differencing can generate numerical instabilities
• Difficult to find solutions for shocked systems• We define a hyperviscosity parameter which is
the ratio of the forward difference of a parameter to third order and first order
• Tracking evolution of the hyperviscosity we can identify numerical noise and apply smoothing where necessary
Why MHD Using GPU’s?
F(i-1,j+1) F(I,j+1) F(i+1,j+1)
F(i-1,j) F(i,j) F(i+1,j)
F(i-1,j-1) F(i,j-1) F(i+1,j-1)
SFt
.
• Excellent scaling with GPU’s but,
• Central differencing requires numerical stabilisation
• Stabilisation with GPU’s trickier, requires• Reduction/maximum
routine• An additional and larger
mesh
• Consider a simplified 2d problem
• Solving flux equation• Derivative using central
diffrencing• Time step using Runge Kutta
Halo Messaging• Each proc has a “ghost” layer
– Used in calculation of update– Obtained from neighbouring left and right processors– Pass top and bottom layers to neighbouring processors
• Become neighbours ghost layers
• Distribute rows over processors N/nproc rows per proc– Every processor stores all N columns
• SMAUG-MPI implements messaging using a 2D halo model for 2D and 3D halo model for 3D
• Consider a 2d model – for simplicity distribute layers over a line of processes
Processor 1
Processor 2
Processor 3
Processor 4
N+1
N+1
1
N
p2min
p3max
p3max
p2min
p1min
p2max
p2max
p1min
Send top layer
Send bottom layerReceive
top layer
Receive bottom layer
MPI Implementation• Based on halo messaging technique
employed in SAC codevoid exchange_halo(vector v) {
gather halo data from v into gpu_buffer1
cudaMemcpy(host_buffer1, gpu_buffer1,...);MPI_Isend(host_buffer1,...,destination,...);MPI_Irecv(host_buffer2,...,source,...);
MPI_Waitall(...);
cudaMemcpy(gpu_buffer2,host_buffer2,...);scatter halo data from gpu_buffer2 to halo regions in v
}
Halo Messaging with GPU Direct
void exchange_halo(vector v) {
gather halo data from v into gpu_buffer1
MPI_Isend(gpu_buffer1,...,destination...);MPI_IRecv(gpu_buffer2,...,source...)
MPI_Waitall(...);
scatter halo data from gpu_buffer2 to halo regions in v}
• Simpler faster call structure
Progress with MPI Implementation
• Successfully running two dimensional models under GPU direct– Wilkes GPU cluster at The University of Cambridge– N8 - GPU Facility, Iceberg
• 2D MPI version is verified• Currently optimising communications performance under
GPU direct• 3D MPI implementation is already implemented still
requires testing
Orszag-Tang Test
200x200 Model at t=0.1, t=0.26, t=0.42 and t=0.58s
A Model of Wave Propagation in the Magnetised Solar Atmmosphere
The model features a Flux Tube with Torsional Driver, with a fully stratified quiet solar atmosphere based on VALIIIC
Grid size is 128x128x128, representing a box in the solar atmosphere of dimensions 1.5x2x2Mm
Flux tube has a magnetic field strength of 1000G
Driver Amplitude 200km/s
Timing for Orszag-Tang Using SAC/SMAUG with Different Architetures
0
100
200
300
400
500
600
0 1000 2000 3000 4000 5000
Grid Dimension
Tim
e f
or
10
0 It
era
tio
ns
(s
ec
on
ds
)
NVIDIA M2070
NVIDIA K20
Intel E5 2670 8c
NVIDIA K40
K20(2x2)
K20(4x4)
Performance Results (Hyperdiffusion disabled)
Grid Size(number of GPUs in brackets)
With GPU direct ( time in s)
Without GPU direct (time in s)
1000x1000(1) 31.54 31.5
1000x1000(2x2) 11.28 11.19
1000x1000(4x4) 12.89 13.7
2044x2044(2x2) 41.3 41.32
2044x2044(4x4) 42.4 43.97
4000x4000(4x4) 77.37 77.44
8000x8000(8x8) 63.3 61.7
8000x8000(10x10) 41.9 41.0
• Timings in seconds for 100 iterations (Orszag-Tang test)
Performance Results (With Hyperdiffusion enabled)
Grid Size(number of GPUs in brackets)
Without GPU direct (time in s)
2044x2044(2x2) 184.1
2044x2044(4x4) 199.89
4000x4000(4x4) 360.71
8000x8000(8x8) 253.8
8000x8000(10x10) 163.6
• Timings in seconds for 100 iterations (Orszag-Tang test)
Conclusions• We have demonstrated that we can successfully compute
large problems by distributing across multiple GPUs • For 2D problems the performance using messaging with and
without GPUdirect is similar.– This is expected to change when 3D models are tested
• It is likely that much of the communications overhead arises from routines used transfer data within the GPU memory– Performance enhancements possible through application architecture
modification
• Further work needed with larger models for comparisons with X86 implementation using MPI
• The algorithm has been implemented in 3D testing of 3D models will be undertaken over the forthcoming weeks