36
Memory Access Patterns For Cellular Automata Using GPGPUs By: James M. Balasalle

Memory Access Patterns For Cellular Automata Using GPGPUs

  • Upload
    donkor

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

By: James M. Balasalle. Memory Access Patterns For Cellular Automata Using GPGPUs. Agenda. Background Information Different Patterns and Techniques Results Case Study: Surface Water Flow Conclusions Questions . Background Info: Parallel Processing. - PowerPoint PPT Presentation

Citation preview

Page 1: Memory Access Patterns For Cellular Automata Using GPGPUs

Memory Access Patterns For Cellular Automata

Using GPGPUs

By: James M. Balasalle

Page 2: Memory Access Patterns For Cellular Automata Using GPGPUs

Agenda

Background Information Different Patterns and Techniques Results Case Study: Surface Water Flow Conclusions Questions

Page 3: Memory Access Patterns For Cellular Automata Using GPGPUs

Background Info: Parallel Processing

How is parallel processing related to Moore’s Law?

Super Computers Multicore CPUs Interconnected, Independent

Machines Clusters, MPI Grid Computing

GPUs

Page 4: Memory Access Patterns For Cellular Automata Using GPGPUs

Background Info: Cellular Automata A cellular automaton (CA) is a

discrete mathematical model used to calculate the global behavior of a complex system using (ideally) simple local rules.

Usually grid-based model of states Values are determined by local

neighbors Wide range of applications

Page 5: Memory Access Patterns For Cellular Automata Using GPGPUs

Background Info: Conway’s Game of Life

The Game of Life, showing several well known patterns: crab, sliders, etc.

Page 6: Memory Access Patterns For Cellular Automata Using GPGPUs

Background Info: Conway’s Game of Life Cellular Automaton Cell has two states: alive and dead Next state is based on the surrounding

8 neighbors Alive Cell:

2 or 3 live neighbors: stay alive, else die Dead Cell:

Exactly 3 live neighbors: come alive, else stay dead

Simple rules lead to complex patterns

Page 7: Memory Access Patterns For Cellular Automata Using GPGPUs

Background Info: SIFT

Scale Invariant Feature Transform Calculation of robust features in an image Features can then be used to identify

images or portions of an image Widely used in Computer Vision

Applications

From: http://acmechimera.blogspot.com/2008/03/paper-review-distinctive-image-features.html

Page 8: Memory Access Patterns For Cellular Automata Using GPGPUs

Background Info: SIFT

SIFT is a pipeline of successive operations

Initial Keypoint detection Keypoint refinement, edge removal Keypoint orientation calculation Keypoint descriptor creation

Page 9: Memory Access Patterns For Cellular Automata Using GPGPUs

Background Info: SIFT

Focus is on Step 1: initial keypoint detection Scale Space creation: successive Gaussian

blurring and downsampling Difference of Gaussians, adjacent in scale

space Local extrema detection in DoG Resulting extrema are initial candidate

keypoints

Page 10: Memory Access Patterns For Cellular Automata Using GPGPUs

Nvidia GPUs

External coprocessor card, connected to system bus

Manages its own DRAM store Made up of one or more Streaming

Multi-processors (SM) Each SM contains

8 Processing cores 16KB of on-chip cache / storage 2 Special Functional Units for

transcendentals, etc

Page 11: Memory Access Patterns For Cellular Automata Using GPGPUs

Nvidia GPUs

Memory Regions: Global Memory – non-cached memory,

similar to RAM for CPU Shared Memory – user-managed, on-chip

cache Texture Memory – alternative access

path for accessing global memory, hardware calculations supported

Constant Memory – immutable cached memory store

Page 12: Memory Access Patterns For Cellular Automata Using GPGPUs

Patterns and Techniques

Two broad categories: Resource Utilization

Different memory regions Memory alignment and coalescence Maximizing bus usage

Overhead Reduction Instruction Reduction Arithmetic intensity

Page 13: Memory Access Patterns For Cellular Automata Using GPGPUs

Patterns and Techniques

Global Memory Conditional logic to handle boundary

cells vs. memory halo Halo achieves an 18% speed increase

Page 14: Memory Access Patterns For Cellular Automata Using GPGPUs

Patterns and Techniques

Shared vs. global memory Utilize faster on-chip cache for

frequently requested data Shared memory is 30% faster

Page 15: Memory Access Patterns For Cellular Automata Using GPGPUs

Patterns and Techniques

Aligned memory: Align data on a 64 or 128-byte boundary Achieved by padding each row For a half-warp, coalescence reduces number of

requests from 16 to 1 (or 2) 8% performance improvement Could possibly require significant host CPU processing

Coalescence: when all memory access requests for a half-warp are aggregated into a single request.

Page 16: Memory Access Patterns For Cellular Automata Using GPGPUs

Patterns and Techniques

Memory Region Shape Minimum bus transaction is 32 bytes, even for

4-byte requests Some halo cells are unaligned, minimize these 16% faster

Page 17: Memory Access Patterns For Cellular Automata Using GPGPUs

Patterns and Techniques

Moving into overhead reduction and arithmetic intensity focused techniques

Index calculations, performed by every thread: unsigned int row = blockIdx.y * blockDim.y + threadIdx.y; unsigned int col = blockIdx.x * blockDim.x + threadIdx.x; int idx = row * width + col;

Approximately 15 total instructions to compute idx

For 1M data elements, 15,000,000 instructions devoted to index calculation

Page 18: Memory Access Patterns For Cellular Automata Using GPGPUs

Patterns and Techniques

Calculate 2 (or more) elements per thread Calculate first index, using ~15 instructions Calculate second index, relative to first, in a single add

instruction For 1M elements, 8,000,000 instructions; a 46% reduction 44% performance improvement, over aligned memory

Page 19: Memory Access Patterns For Cellular Automata Using GPGPUs

Patterns and Techniques

Arithmetic intensity: ratio of actual computation to memory loads and index calculations

Multiple elements per thread Multi-generation implementations Data packing / interleaving

Page 20: Memory Access Patterns For Cellular Automata Using GPGPUs

Patterns and Techniques

Multi-generational kernel Compute 2 generations in a single kernel launch Reduces total index calculations Reduces total memory loads Uses shared memory for temporary storage

Page 21: Memory Access Patterns For Cellular Automata Using GPGPUs

Patterns and Techniques

Multi-generational kernel Results are poor Instruction count is limiting factor

Index calculations!

Page 22: Memory Access Patterns For Cellular Automata Using GPGPUs

Patterns and Techniques

Multi-generational kernel thread allocations One thread per effective element

Results in many threads loading multiple elements

And computing multiple elements for each generation

Each load, computation requires index calculations

One thread for each element required to be loaded Not implemented, future work

Page 23: Memory Access Patterns For Cellular Automata Using GPGPUs

SIFT Results

Gaussian Blur Implemented as a non-separable convolution Multiply a square matrix by each element and

its neighbors Square matrix is result of Gaussian function Data elements are pixel values of image in

question

2-element is faster, approximately 37% Improvement due to instruction reduction

Page 24: Memory Access Patterns For Cellular Automata Using GPGPUs

SIFT Results

Difference of Gaussians Simply subtract results of blurring kernel Kernel is extremely simple: more index

calculations than effective operations Kernel utilizes data packing Too simple to measure

Page 25: Memory Access Patterns For Cellular Automata Using GPGPUs

SIFT Results

Extrema Detection Each element compares itself to its

neighbors Minimum and maximum values are

extrema

Page 26: Memory Access Patterns For Cellular Automata Using GPGPUs

SIFT Results

Extrema Detection 2-element kernel is fastest Rectangular kernel not effective since

algorithm has built-in bounds checking

Page 27: Memory Access Patterns For Cellular Automata Using GPGPUs

Case Study: Surface Water Flow Based on Masters Thesis by Jay

Parsons Using a digital elevation map,

determine the amount and location of water during and after a rain event

Built upon a CA model that uses elevation distance between cells to determine where water flows

Page 28: Memory Access Patterns For Cellular Automata Using GPGPUs

Case Study: Surface Water Flow Sample output

Page 29: Memory Access Patterns For Cellular Automata Using GPGPUs

Case Study: Surface Water Flow Initial Steps

Port from Java to C++ Gain understanding Create a baseline implementation for

timing comparisons Initial GPU implementation Application of techniques

Page 30: Memory Access Patterns For Cellular Automata Using GPGPUs

Case Study: Surface Water Flow

Problem: During the processing of one cell, state

values of its neighbors were updated Design decision to make calculation of

incoming water easier Complicates CA implementation Push vs. Pull methods

Page 31: Memory Access Patterns For Cellular Automata Using GPGPUs

Case Study: Surface Water Flow Modify implementation, simply CA

rules New value is:

current value – outgoing volume + incoming volume

Incoming volume more difficult to calculate

Dramatic improvement: 3.6x speedup Reduced instruction count Better usage of shared memory

Page 32: Memory Access Patterns For Cellular Automata Using GPGPUs

Case Study: Surface Water Flow

Page 33: Memory Access Patterns For Cellular Automata Using GPGPUs

Recap

What worked Shared memory, memory alignment, 2-

element processing, rectangular regions What didn’t work

Multi-generation kernels – more investigation needed

Future work Data packing Texture memory

Page 34: Memory Access Patterns For Cellular Automata Using GPGPUs

Observations

Balance between instruction-bound and memory-bound

Strict CA rules help performance and implementation

Powerful analysis tools required Compromises

Shared Memory 2-element processing Rectangular regions

Page 35: Memory Access Patterns For Cellular Automata Using GPGPUs

Conclusion

GPUs are a great platform for cellular automata

Other problems that exhibit spatial locality

Techniques presented have real, measureable impact

Straightforward implementation Applicable to wide range of problems Worthwhile area of research

Page 36: Memory Access Patterns For Cellular Automata Using GPGPUs

Questions??