Upload
benjamin-block
View
562
Download
0
Tags:
Embed Size (px)
DESCRIPTION
This is the talk I gave at the 2nd International Symposium “Computer Simulations on GPU” (SimGPU 2013)
Citation preview
Determination of line tension
in the 3D Ising model on GPUs
Benjamin Block, Tobias Preis, David Winter, Suam Kim,
Peter Virnau, Kurt Binder
University of Mainz, Institute for Physics
SimGPU 2013
Topic Touched
1. Ising Model on GPU
Topic Touched
1. Ising Model on GPU
2. Line Tension Estimation
Ising Model
OrderedRandom Transition
+ nearest neighbor interaction <
Monte Carlo
Perform successive spin flips!
Probability: Metropolis criterion
Inherently serial... but
GPU Implementation
• GPUs: massively parallel processing
T. Preis, P. Virnau, W. Paul, J. J. Schneider:
GPU Accelerated Monte Carlo Simulation of
the 2D and 3D Ising Model, J. Comp. Phys.,
228 (2009)
• Architecture specific optimization
• Multi GPU implementation
Parallelization of Lattice Updates
Idea: Update non-interacting domains in parallel
Checkerboard Update
Reduce slow memory access
Reduce slow memory access
uint4 blocks
in global
memory
Idea: Store spins in 128 bit (uint4) chunks
Reduce slow memory access
uint4 blocks
in global
memory
Idea: Store spins in 128 bit (uint4) chunks
Access 128 spins with one memory lookup
Reduce slow memory access
uint4 blocks
in global
memory
One
thread
Idea: Store spins in 128 bit (uint4) chunks
Access 128 spins with one memory lookup
Extract spins in local thread memory (registers) for
computation
Update scheme
uint4
Update scheme
uint4
Update schemeExtract chunk in
thread
uint4
Update schemeExtract chunk in
thread
Perform
Computations(draw random
number, evaluate
Metropolis criterion)
uint4
Update schemeExtract chunk in
thread
Perform
Computations(draw random
number, evaluate
Metropolis criterion)
Update pattern
uint4
XOR
Update schemeExtract chunk in
thread
Perform
Computations(draw random
number, evaluate
Metropolis criterion)
Old spins New spinsUpdate pattern
=
uint4
Multispin Coding?
• Multiple spins are coded in memory unit (128
spins in 128 bit)
Multispin Coding?
• Multiple spins are coded in memory unit (128
spins in 128 bit)
• Computation is not done on encoded spins in
parallel but serial in each chunk
Multispin Coding?
• Multiple spins are coded in memory unit (128
spins in 128 bit)
• Computation is not done on encoded spins in
parallel but serial in each chunk
• Multispin coding algorithms designed for CPUs
were not efficient on GPU
Multispin Coding?
• Multiple spins are coded in memory unit (128
spins in 128 bit)
• Computation is not done on encoded spins in
parallel but serial in each chunk
• Multispin coding algorithms designed for CPUs
were not efficient on GPU
Why??
Multispin Coding
Array of spins (1 bit = 1 spin)
?
Array of spins (1 bit = 1 spin)
MC step:
?
Array of spins (1 bit = 1 spin)
MC step:
?
Array of spins (1 bit = 1 spin)
MC step:
In advance:
?
Array of spins (1 bit = 1 spin)
MC step:Pooled
random
patterns
Neighbors
(Bitwise)
Judgement function:
(for each
energy level)
?
Array of spins (1 bit = 1 spin)
MC step:
Pool of random
patterns
?
Array of spins (1 bit = 1 spin)
MC step:
select one
pattern
randomly
Construct update pattern
Array of spins (1 bit = 1 spin)
XOR
Array of spins (1 bit = 1 spin)
XOR
=
Spins for next step
Downsides of Pooling
• Impairs quality of simulation (the smaller the
pool the less random)
Downsides of Pooling
• Impairs quality of simulation (the smaller the
pool the less random)
• Low flexibility (external fields...)
Downsides of Pooling
• Impairs quality of simulation (the smaller the
pool the less random)
• Low flexibility (external fields...)
• Relies on a lot of precomputation and random
memory lookups (GPU killer)
Performance
CPU
simple
CPU
multispin
coding
GPU
simple
GPU
optimized
~ 20x
~ 200x
Results from 2011
2D Ising
GPU: NVIDIA Tesla S1070
CPU: Intel i7 (2.67 GHz, 1 core)
Performance
CPU
simple
CPU
multispin
coding
GPU
simple
GPU
optimized
~ 20xGPU: NVIDIA Tesla S1070
CPU: Intel i7 (2.67 GHz, 1 core)
Results from 2011
2D Ising
Performance
CPU
simple
CPU
multispin
coding
GPU
simple
GPU
optimized
~ 20xGPU: NVIDIA Tesla S1070
CPU: Intel i7 (2.67 GHz, 1 core)
Results from 2011
2D Ising
8x, still one core!
Performance
CPU
simple
CPU
multispin
coding
GPU
simple
GPU
optimizedResults from 2011
2D Ising
GPU: NVIDIA Tesla S1070
CPU: Intel i7 (2.67 GHz, 1 core)
Performance
CPU
simple
CPU
multispin
coding
GPU
simple
GPU
optimized
~ 20xResults from 2011
2D Ising
GPU: NVIDIA Tesla S1070
CPU: Intel i7 (2.67 GHz, 1 core)
Performance
CPU
simple
CPU
multispin
coding
GPU
simple
GPU
optimized
~ 20x
~ 200x
Results from 2011
2D Ising
GPU: NVIDIA Tesla S1070
CPU: Intel i7 (2.67 GHz, 1 core)
Simulation on multiple GPUs
Spread spin lattice over many GPUs
in different machines
Exchange border information
between machines via MPI
Simulation Domains per GPU Border Arrays
Multi-GPU Performance
Measure: Single spin flips per GPU
Communication
overhead
Bottleneck for
small system sizes
• 64 GPUs: 256 GB video memory
• Enough for a lattice of 800.000 x 800.000 spins
• One lattice sweep: 3 seconds on pre-Fermi (S1070)
hardware
?
?
OpenCL?
?
Platform independence
51
KernelsIdea: Hide language differences in macros
Macros expand to different expressions on each platform
•CUDA (Driver API)
•OpenCL
•Host C
Initialization
• Initialize
• Load “Device Programs” (Kernels) from source
• Create Data Containers that take care of data
Run kernel with parameters
Use data on host
Cross platform performance
56
CPU: i7
Nehalem
Nvidia:
Geforce GTX
580
AMD: HD 6970
3D Ising
Example
Results
Results
• Downside: Lowest common denominator
(CUDA has a lot more features by now)
Results
• Downside: Lowest common denominator
(CUDA has a lot more features by now)
• No explicit copying needed (containers job)
Results
• Downside: Lowest common denominator
(CUDA has a lot more features by now)
• No explicit copying needed (containers job)
• In our case: OpenCL was 10% slower on NVIDIA card
(Geforce GTX580)
Results
• Downside: Lowest common denominator
(CUDA has a lot more features by now)
• No explicit copying needed (containers job)
• In our case: OpenCL was 10% slower on NVIDIA card
(Geforce GTX580)
• slower on comparable AMD card (Radeon HD 6970)
Results
• Downside: Lowest common denominator
(CUDA has a lot more features by now)
• No explicit copying needed (containers job)
• In our case: OpenCL was 10% slower on NVIDIA card
(Geforce GTX580)
• slower on comparable AMD card (Radeon HD 6970)
• Take this with a grain of salt
Nucleation
Nucleation phenomena
• Nucleation important in materials
research, atmosphere, etc
Nucleation
Phase 1 Phase 2
Nucleation
Phase 1 Phase 2
Induced by nuclei!
Most spins up Most spins down
Heterogeneous Nucleation
Wall attached droplet
=
Simulation in the Ising Model
Winter D., Virnau P., Binder K., PRL Volume 103 Issue 22 (2009)
Young
Free Energy of Droplet
Η=0, Θ=90o
Winter D., Virnau P., Binder K., PRL Volume 103 Issue 22 (2009)
Young
Line Contribution
Line Contribution
A different method...
A different method...
Surface field H > 0 which tilts interface
A different method...
Surface field H > 0 which tilts interface
A different method...
Antiperiodic Boundary
Conditions force and stabilize
an interface
Surface field H > 0 which tilts interface
A different method...
Antiperiodic Boundary
Conditions force and stabilize
an interface
Surface field H > 0 which tilts interface
Angle is limited by geometry...
Flatten geometry
Lx
Ly
Flattened geometry in dimension X allows for stronger tilt
Lz
Boundary Condition
Implementation
83Simulate one extra chunk in each dimension
Boundary Condition
Implementation
Periodic: Exchange borders
Boundary Condition
Implementation
APBC: Read, XOR 1, Write
Thermodynamic integration
• Vary box size in all dimensions
• Measure Free Energies of surfaces by
integration over magnetization
• Expressions can be derived for the Free Energy
differences in each dimension
Young’s Equation
(1)
(2)
(3)
• Expressions can be derived for the Free Energy
differences in each dimension
Young’s Equation
Combination of the first two expressions
Allows extraction of Line Tension
(1)
(2)
(3)
• Which can be combined to an expression for the
line tension:
(1) (2)(3)
Putting it together
- -
9191(2011) Kim et al.
T=3.0
Side view
Top view
Density Profile
3D System:
56x120x120 spins
9393
Conclusion
Conclusion
• Direct method to measure line tension for tilted
surfaces
Conclusion
• Direct method to measure line tension for tilted
surfaces
• Our first real world use of the Ising Model on
GPUs
Conclusion
• Direct method to measure line tension for tilted
surfaces
• Our first real world use of the Ising Model on
GPUs
• Optimization is important (CPU and GPU) for
fair comparison
Conclusion
• Direct method to measure line tension for tilted
surfaces
• Our first real world use of the Ising Model on
GPUs
• Optimization is important (CPU and GPU) for
fair comparison
• Platform independence is possible (useful?)
Conclusion
• Direct method to measure line tension for tilted
surfaces
• Our first real world use of the Ising Model on
GPUs
• Optimization is important (CPU and GPU) for
fair comparison
• Platform independence is possible (useful?)
• The Ising model is a good candidate for parallel
processing on GPU clusters
Publications
• Monte Carlo Test of the Classical Theory for Heterogeneous
Nucleation Barriers
Winter D., Virnau P., Binder K., Phys.Rev.Let. 103, 22 (2009)
• Multi-GPU Accelerated Multi-Spin Monte Carlo Simulations of
the 2D Ising model
Block, B., Virnau, P., Preis, T.:, Computer Physics Communications,
Volume 181, Issue 9 (2010)
• Monte Carlo Methods for Estimating Interfacial Free Energies
and Line Tensions
Binder, K., Block., B., Das, S. K., Virnau, P., Winter, D., J. Stat.
Phys (2011)
• Platform independent, efficient implementation of the Ising
model on parallel acceleration devices
Block B. J., Eur. Phys. J. Spec. Top. (2012)