Memory Access Patterns For Cellular Automata Using GPGPUs

Memory Access Patterns For Cellular Automata

Using GPGPUs

By: James M. Balasalle

Agenda

Background Information Different Patterns and Techniques Results Case Study: Surface Water Flow Conclusions Questions

Background Info: Parallel Processing

How is parallel processing related to Moore’s Law?

Super Computers Multicore CPUs Interconnected, Independent

Machines Clusters, MPI Grid Computing

GPUs

Background Info: Cellular Automata A cellular automaton (CA) is a

discrete mathematical model used to calculate the global behavior of a complex system using (ideally) simple local rules.

Usually grid-based model of states Values are determined by local

neighbors Wide range of applications

Background Info: Conway’s Game of Life

The Game of Life, showing several well known patterns: crab, sliders, etc.

Background Info: Conway’s Game of Life Cellular Automaton Cell has two states: alive and dead Next state is based on the surrounding

8 neighbors Alive Cell:

2 or 3 live neighbors: stay alive, else die Dead Cell:

Exactly 3 live neighbors: come alive, else stay dead

Simple rules lead to complex patterns

Background Info: SIFT

Scale Invariant Feature Transform Calculation of robust features in an image Features can then be used to identify

images or portions of an image Widely used in Computer Vision

Applications

From: http://acmechimera.blogspot.com/2008/03/paper-review-distinctive-image-features.html

http://acmechimera.blogspot.com/2008/03/paper-review-distinctive-image-features.html


SIFT is a pipeline of successive operations

Initial Keypoint detection Keypoint refinement, edge removal Keypoint orientation calculation Keypoint descriptor creation


Focus is on Step 1: initial keypoint detection Scale Space creation: successive Gaussian

blurring and downsampling Difference of Gaussians, adjacent in scale

space Local extrema detection in DoG Resulting extrema are initial candidate

keypoints

Nvidia GPUs

External coprocessor card, connected to system bus

Manages its own DRAM store Made up of one or more Streaming

Multi-processors (SM) Each SM contains

8 Processing cores 16KB of on-chip cache / storage 2 Special Functional Units for

transcendentals, etc

Nvidia GPUs

Memory Regions: Global Memory – non-cached memory,

similar to RAM for CPU Shared Memory – user-managed, on-chip

cache Texture Memory – alternative access

path for accessing global memory, hardware calculations supported

Constant Memory – immutable cached memory store

Patterns and Techniques

Two broad categories: Resource Utilization

Different memory regions Memory alignment and coalescence Maximizing bus usage

Overhead Reduction Instruction Reduction Arithmetic intensity


Global Memory Conditional logic to handle boundary

cells vs. memory halo Halo achieves an 18% speed increase


Shared vs. global memory Utilize faster on-chip cache for

frequently requested data Shared memory is 30% faster


Aligned memory: Align data on a 64 or 128-byte boundary Achieved by padding each row For a half-warp, coalescence reduces number of

requests from 16 to 1 (or 2) 8% performance improvement Could possibly require significant host CPU processing

Coalescence: when all memory access requests for a half-warp are aggregated into a single request.


Memory Region Shape Minimum bus transaction is 32 bytes, even for

4-byte requests Some halo cells are unaligned, minimize these 16% faster


Moving into overhead reduction and arithmetic intensity focused techniques

Index calculations, performed by every thread: unsigned int row = blockIdx.y * blockDim.y + threadIdx.y; unsigned int col = blockIdx.x * blockDim.x + threadIdx.x; int idx = row * width + col;

Approximately 15 total instructions to compute idx

For 1M data elements, 15,000,000 instructions devoted to index calculation


Calculate 2 (or more) elements per thread Calculate first index, using ~15 instructions Calculate second index, relative to first, in a single add

instruction For 1M elements, 8,000,000 instructions; a 46% reduction 44% performance improvement, over aligned memory


Arithmetic intensity: ratio of actual computation to memory loads and index calculations

Multiple elements per thread Multi-generation implementations Data packing / interleaving


Multi-generational kernel Compute 2 generations in a single kernel launch Reduces total index calculations Reduces total memory loads Uses shared memory for temporary storage


Multi-generational kernel Results are poor Instruction count is limiting factor

Index calculations!


Multi-generational kernel thread allocations One thread per effective element

Results in many threads loading multiple elements

And computing multiple elements for each generation

Each load, computation requires index calculations

One thread for each element required to be loaded Not implemented, future work

SIFT Results

Gaussian Blur Implemented as a non-separable convolution Multiply a square matrix by each element and

its neighbors Square matrix is result of Gaussian function Data elements are pixel values of image in

question

2-element is faster, approximately 37% Improvement due to instruction reduction

SIFT Results

Difference of Gaussians Simply subtract results of blurring kernel Kernel is extremely simple: more index

calculations than effective operations Kernel utilizes data packing Too simple to measure

SIFT Results

Extrema Detection Each element compares itself to its

neighbors Minimum and maximum values are

extrema

SIFT Results

Extrema Detection 2-element kernel is fastest Rectangular kernel not effective since

algorithm has built-in bounds checking

Case Study: Surface Water Flow Based on Masters Thesis by Jay

Parsons Using a digital elevation map,

determine the amount and location of water during and after a rain event

Built upon a CA model that uses elevation distance between cells to determine where water flows

Case Study: Surface Water Flow Sample output

Case Study: Surface Water Flow Initial Steps

Port from Java to C++ Gain understanding Create a baseline implementation for

timing comparisons Initial GPU implementation Application of techniques

Case Study: Surface Water Flow

Problem: During the processing of one cell, state

values of its neighbors were updated Design decision to make calculation of

incoming water easier Complicates CA implementation Push vs. Pull methods

Case Study: Surface Water Flow Modify implementation, simply CA

rules New value is:

current value – outgoing volume + incoming volume

Incoming volume more difficult to calculate

Dramatic improvement: 3.6x speedup Reduced instruction count Better usage of shared memory

Case Study: Surface Water Flow

Recap

What worked Shared memory, memory alignment, 2-

element processing, rectangular regions What didn’t work

Multi-generation kernels – more investigation needed

Future work Data packing Texture memory

Observations

Balance between instruction-bound and memory-bound

Strict CA rules help performance and implementation

Powerful analysis tools required Compromises

Shared Memory 2-element processing Rectangular regions

Conclusion

GPUs are a great platform for cellular automata

Other problems that exhibit spatial locality

Techniques presented have real, measureable impact

Straightforward implementation Applicable to wide range of problems Worthwhile area of research

Questions??

Documents

Memory Access Patterns For Cellular Automata Using GPGPUs