Determination of line tension in the 3D Ising model on GPUs

Determination of line tension

in the 3D Ising model on GPUs

Benjamin Block, Tobias Preis, David Winter, Suam Kim,

Peter Virnau, Kurt Binder

University of Mainz, Institute for Physics

SimGPU 2013

Topic Touched

1. Ising Model on GPU

Topic Touched

1. Ising Model on GPU

2. Line Tension Estimation

Ising Model

OrderedRandom Transition

+ nearest neighbor interaction <

Monte Carlo

Perform successive spin flips!

Probability: Metropolis criterion

Inherently serial... but

GPU Implementation

• GPUs: massively parallel processing

T. Preis, P. Virnau, W. Paul, J. J. Schneider:

GPU Accelerated Monte Carlo Simulation of

the 2D and 3D Ising Model, J. Comp. Phys.,

228 (2009)

• Architecture specific optimization

• Multi GPU implementation

Parallelization of Lattice Updates

Idea: Update non-interacting domains in parallel

Checkerboard Update

Reduce slow memory access


uint4 blocks

in global

memory

Idea: Store spins in 128 bit (uint4) chunks


uint4 blocks

in global

memory


Access 128 spins with one memory lookup


uint4 blocks

in global

memory

One

thread


Access 128 spins with one memory lookup

Extract spins in local thread memory (registers) for

computation

Update scheme

uint4

Update scheme

uint4

Update schemeExtract chunk in

thread

uint4


thread

Perform

Computations(draw random

number, evaluate

Metropolis criterion)

uint4


thread

Perform


number, evaluate


Update pattern

uint4

XOR


thread

Perform


number, evaluate


Old spins New spinsUpdate pattern

=

uint4

Multispin Coding?

• Multiple spins are coded in memory unit (128

spins in 128 bit)

Multispin Coding?


spins in 128 bit)

• Computation is not done on encoded spins in

parallel but serial in each chunk

Multispin Coding?


spins in 128 bit)



• Multispin coding algorithms designed for CPUs

were not efficient on GPU

Multispin Coding?


spins in 128 bit)



• Multispin coding algorithms designed for CPUs

were not efficient on GPU

Why??

Multispin Coding

Array of spins (1 bit = 1 spin)

?


MC step:

?


MC step:

?


MC step:

In advance:

?


MC step:Pooled

random

patterns

Neighbors

(Bitwise)

Judgement function:

(for each

energy level)

?


MC step:

Pool of random

patterns

?


MC step:

select one

pattern

randomly

Construct update pattern


XOR


XOR

=

Spins for next step

Downsides of Pooling

• Impairs quality of simulation (the smaller the

pool the less random)




• Low flexibility (external fields...)




• Low flexibility (external fields...)

• Relies on a lot of precomputation and random

memory lookups (GPU killer)

Performance

CPU

simple

CPU

multispin

coding

GPU

simple

GPU

optimized

~ 20x

~ 200x

Results from 2011

2D Ising

GPU: NVIDIA Tesla S1070

CPU: Intel i7 (2.67 GHz, 1 core)

Performance

CPU

simple

CPU

multispin

coding

GPU

simple

GPU

optimized

~ 20xGPU: NVIDIA Tesla S1070


Results from 2011

2D Ising

Performance

CPU

simple

CPU

multispin

coding

GPU

simple

GPU

optimized

~ 20xGPU: NVIDIA Tesla S1070


Results from 2011

2D Ising

8x, still one core!

Performance

CPU

simple

CPU

multispin

coding

GPU

simple

GPU

optimizedResults from 2011

2D Ising



Performance

CPU

simple

CPU

multispin

coding

GPU

simple

GPU

optimized

~ 20xResults from 2011

2D Ising



Performance

CPU

simple

CPU

multispin

coding

GPU

simple

GPU

optimized

~ 20x

~ 200x

Results from 2011

2D Ising



Simulation on multiple GPUs

Spread spin lattice over many GPUs

in different machines

Exchange border information

between machines via MPI

Simulation Domains per GPU Border Arrays

Multi-GPU Performance

Measure: Single spin flips per GPU

Communication

overhead

Bottleneck for

small system sizes

• 64 GPUs: 256 GB video memory

• Enough for a lattice of 800.000 x 800.000 spins

• One lattice sweep: 3 seconds on pre-Fermi (S1070)

hardware

?

?

OpenCL?

?

Platform independence

51

KernelsIdea: Hide language differences in macros

Macros expand to different expressions on each platform

•CUDA (Driver API)

•OpenCL

•Host C

Initialization

• Initialize

• Load “Device Programs” (Kernels) from source

• Create Data Containers that take care of data

Run kernel with parameters

Use data on host

Cross platform performance

56

CPU: i7

Nehalem

Nvidia:

Geforce GTX

580

AMD: HD 6970

3D Ising

Example

Results

Results

• Downside: Lowest common denominator

(CUDA has a lot more features by now)

Results



• No explicit copying needed (containers job)

Results




• In our case: OpenCL was 10% slower on NVIDIA card

(Geforce GTX580)

Results





(Geforce GTX580)

• slower on comparable AMD card (Radeon HD 6970)

Results





(Geforce GTX580)

• slower on comparable AMD card (Radeon HD 6970)

• Take this with a grain of salt

Nucleation

Nucleation phenomena

• Nucleation important in materials

research, atmosphere, etc

Nucleation

Phase 1 Phase 2

Nucleation

Phase 1 Phase 2

Induced by nuclei!

Most spins up Most spins down

Heterogeneous Nucleation

Wall attached droplet

=

Simulation in the Ising Model

Winter D., Virnau P., Binder K., PRL Volume 103 Issue 22 (2009)

Young

Free Energy of Droplet

Η=0, Θ=90o

Winter D., Virnau P., Binder K., PRL Volume 103 Issue 22 (2009)

Young

Line Contribution

Line Contribution

A different method...


Surface field H > 0 which tilts interface




Antiperiodic Boundary

Conditions force and stabilize

an interface



Antiperiodic Boundary

Conditions force and stabilize

an interface


Angle is limited by geometry...

Flatten geometry

Lx

Ly

Flattened geometry in dimension X allows for stronger tilt

Lz

Boundary Condition

Implementation

83Simulate one extra chunk in each dimension

Boundary Condition

Implementation

Periodic: Exchange borders

Boundary Condition

Implementation

APBC: Read, XOR 1, Write

Thermodynamic integration

• Vary box size in all dimensions

• Measure Free Energies of surfaces by

integration over magnetization

• Expressions can be derived for the Free Energy

differences in each dimension

Young’s Equation

(1)

(2)

(3)

• Expressions can be derived for the Free Energy

differences in each dimension

Young’s Equation

Combination of the first two expressions

Allows extraction of Line Tension

(1)

(2)

(3)

• Which can be combined to an expression for the

line tension:

(1) (2)(3)

Putting it together

- -

9191(2011) Kim et al.

T=3.0

Side view

Top view

Density Profile

3D System:

56x120x120 spins

9393

Conclusion

Conclusion

• Direct method to measure line tension for tilted

surfaces

Conclusion


surfaces

• Our first real world use of the Ising Model on

GPUs

Conclusion


surfaces


GPUs

• Optimization is important (CPU and GPU) for

fair comparison

Conclusion


surfaces


GPUs


fair comparison

• Platform independence is possible (useful?)

Conclusion


surfaces


GPUs


fair comparison

• Platform independence is possible (useful?)

• The Ising model is a good candidate for parallel

processing on GPU clusters

Publications

• Monte Carlo Test of the Classical Theory for Heterogeneous

Nucleation Barriers

Winter D., Virnau P., Binder K., Phys.Rev.Let. 103, 22 (2009)

• Multi-GPU Accelerated Multi-Spin Monte Carlo Simulations of

the 2D Ising model

Block, B., Virnau, P., Preis, T.:, Computer Physics Communications,

Volume 181, Issue 9 (2010)

• Monte Carlo Methods for Estimating Interfacial Free Energies

and Line Tensions

Binder, K., Block., B., Das, S. K., Virnau, P., Winter, D., J. Stat.

Phys (2011)

• Platform independent, efficient implementation of the Ising

model on parallel acceleration devices

Block B. J., Eur. Phys. J. Spec. Top. (2012)

Technology

Determination of line tension in the 3D Ising model on GPUs