Upload
donkor
View
38
Download
0
Embed Size (px)
DESCRIPTION
By: James M. Balasalle. Memory Access Patterns For Cellular Automata Using GPGPUs. Agenda. Background Information Different Patterns and Techniques Results Case Study: Surface Water Flow Conclusions Questions . Background Info: Parallel Processing. - PowerPoint PPT Presentation
Citation preview
Memory Access Patterns For Cellular Automata
Using GPGPUs
By: James M. Balasalle
Agenda
Background Information Different Patterns and Techniques Results Case Study: Surface Water Flow Conclusions Questions
Background Info: Parallel Processing
How is parallel processing related to Moore’s Law?
Super Computers Multicore CPUs Interconnected, Independent
Machines Clusters, MPI Grid Computing
GPUs
Background Info: Cellular Automata A cellular automaton (CA) is a
discrete mathematical model used to calculate the global behavior of a complex system using (ideally) simple local rules.
Usually grid-based model of states Values are determined by local
neighbors Wide range of applications
Background Info: Conway’s Game of Life
The Game of Life, showing several well known patterns: crab, sliders, etc.
Background Info: Conway’s Game of Life Cellular Automaton Cell has two states: alive and dead Next state is based on the surrounding
8 neighbors Alive Cell:
2 or 3 live neighbors: stay alive, else die Dead Cell:
Exactly 3 live neighbors: come alive, else stay dead
Simple rules lead to complex patterns
Background Info: SIFT
Scale Invariant Feature Transform Calculation of robust features in an image Features can then be used to identify
images or portions of an image Widely used in Computer Vision
Applications
From: http://acmechimera.blogspot.com/2008/03/paper-review-distinctive-image-features.html
Background Info: SIFT
SIFT is a pipeline of successive operations
Initial Keypoint detection Keypoint refinement, edge removal Keypoint orientation calculation Keypoint descriptor creation
Background Info: SIFT
Focus is on Step 1: initial keypoint detection Scale Space creation: successive Gaussian
blurring and downsampling Difference of Gaussians, adjacent in scale
space Local extrema detection in DoG Resulting extrema are initial candidate
keypoints
Nvidia GPUs
External coprocessor card, connected to system bus
Manages its own DRAM store Made up of one or more Streaming
Multi-processors (SM) Each SM contains
8 Processing cores 16KB of on-chip cache / storage 2 Special Functional Units for
transcendentals, etc
Nvidia GPUs
Memory Regions: Global Memory – non-cached memory,
similar to RAM for CPU Shared Memory – user-managed, on-chip
cache Texture Memory – alternative access
path for accessing global memory, hardware calculations supported
Constant Memory – immutable cached memory store
Patterns and Techniques
Two broad categories: Resource Utilization
Different memory regions Memory alignment and coalescence Maximizing bus usage
Overhead Reduction Instruction Reduction Arithmetic intensity
Patterns and Techniques
Global Memory Conditional logic to handle boundary
cells vs. memory halo Halo achieves an 18% speed increase
Patterns and Techniques
Shared vs. global memory Utilize faster on-chip cache for
frequently requested data Shared memory is 30% faster
Patterns and Techniques
Aligned memory: Align data on a 64 or 128-byte boundary Achieved by padding each row For a half-warp, coalescence reduces number of
requests from 16 to 1 (or 2) 8% performance improvement Could possibly require significant host CPU processing
Coalescence: when all memory access requests for a half-warp are aggregated into a single request.
Patterns and Techniques
Memory Region Shape Minimum bus transaction is 32 bytes, even for
4-byte requests Some halo cells are unaligned, minimize these 16% faster
Patterns and Techniques
Moving into overhead reduction and arithmetic intensity focused techniques
Index calculations, performed by every thread: unsigned int row = blockIdx.y * blockDim.y + threadIdx.y; unsigned int col = blockIdx.x * blockDim.x + threadIdx.x; int idx = row * width + col;
Approximately 15 total instructions to compute idx
For 1M data elements, 15,000,000 instructions devoted to index calculation
Patterns and Techniques
Calculate 2 (or more) elements per thread Calculate first index, using ~15 instructions Calculate second index, relative to first, in a single add
instruction For 1M elements, 8,000,000 instructions; a 46% reduction 44% performance improvement, over aligned memory
Patterns and Techniques
Arithmetic intensity: ratio of actual computation to memory loads and index calculations
Multiple elements per thread Multi-generation implementations Data packing / interleaving
Patterns and Techniques
Multi-generational kernel Compute 2 generations in a single kernel launch Reduces total index calculations Reduces total memory loads Uses shared memory for temporary storage
Patterns and Techniques
Multi-generational kernel Results are poor Instruction count is limiting factor
Index calculations!
Patterns and Techniques
Multi-generational kernel thread allocations One thread per effective element
Results in many threads loading multiple elements
And computing multiple elements for each generation
Each load, computation requires index calculations
One thread for each element required to be loaded Not implemented, future work
SIFT Results
Gaussian Blur Implemented as a non-separable convolution Multiply a square matrix by each element and
its neighbors Square matrix is result of Gaussian function Data elements are pixel values of image in
question
2-element is faster, approximately 37% Improvement due to instruction reduction
SIFT Results
Difference of Gaussians Simply subtract results of blurring kernel Kernel is extremely simple: more index
calculations than effective operations Kernel utilizes data packing Too simple to measure
SIFT Results
Extrema Detection Each element compares itself to its
neighbors Minimum and maximum values are
extrema
SIFT Results
Extrema Detection 2-element kernel is fastest Rectangular kernel not effective since
algorithm has built-in bounds checking
Case Study: Surface Water Flow Based on Masters Thesis by Jay
Parsons Using a digital elevation map,
determine the amount and location of water during and after a rain event
Built upon a CA model that uses elevation distance between cells to determine where water flows
Case Study: Surface Water Flow Sample output
Case Study: Surface Water Flow Initial Steps
Port from Java to C++ Gain understanding Create a baseline implementation for
timing comparisons Initial GPU implementation Application of techniques
Case Study: Surface Water Flow
Problem: During the processing of one cell, state
values of its neighbors were updated Design decision to make calculation of
incoming water easier Complicates CA implementation Push vs. Pull methods
Case Study: Surface Water Flow Modify implementation, simply CA
rules New value is:
current value – outgoing volume + incoming volume
Incoming volume more difficult to calculate
Dramatic improvement: 3.6x speedup Reduced instruction count Better usage of shared memory
Case Study: Surface Water Flow
Recap
What worked Shared memory, memory alignment, 2-
element processing, rectangular regions What didn’t work
Multi-generation kernels – more investigation needed
Future work Data packing Texture memory
Observations
Balance between instruction-bound and memory-bound
Strict CA rules help performance and implementation
Powerful analysis tools required Compromises
Shared Memory 2-element processing Rectangular regions
Conclusion
GPUs are a great platform for cellular automata
Other problems that exhibit spatial locality
Techniques presented have real, measureable impact
Straightforward implementation Applicable to wide range of problems Worthwhile area of research
Questions??