35
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Labo ratory , Arizona State University DAC2011

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Embed Size (px)

Citation preview

Page 1: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Yooseong Kim and Aviral ShrivastavaCompiler and Microarchitecture Laboratory, Arizona State University

DAC2011

Page 2: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Outline

Introduction Preliminaries Motivating examples CuMAPz approach Experimental results and Conclusions

Page 3: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Introduction

Currently, the computational power of Graphics Processing Units (GPUs) has reached teraFLOP scale.

NVIDIA CUDA and OpenCL make GPGPU (General Purpose computation on GPUs) programming more easier.

The performance will be heavily affected by memory performance for the sake of large data size.

Page 4: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Introduction (cont.)

Shared memory is as fast as registers, and is the only fast memory where both reads and writes are enabled.

Many factors affect performance: data reuse, global memory access coalescing, shared memory, bank conflict, channel skew.

Develops CuMAPz (CUDA Memory Access Pattern analyZer) to analyze the memory performance of CUDA program.

Page 5: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Preliminaries

NVIDIA GPU architecture. Comparisons between CPU and GPU. CUDA programming. Memory coalescing. Execution of GPU thread

Page 6: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Architecture of Nvidia GTX280

A collection of 30 multiprocessors, with 8 streaming processors each.

The 30 multiprocessors share one off-chip global memory. Access time: about 300 clock cycles

Each multiprocessor has a on-chip memory shared by that 8 streaming processors. Access time: 2 clock cycles

Page 7: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Architecture diagram

Page 8: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

About some differences between GPU and CPU

GPU (NVIDIA GeForce 8800 GTX)

CPU (Intel Pentium 4)

Cores and clock rate

128 / 575MHz (core clock), 1.35GHz (shader clock)

1 / 3.0GHz

flops 345.6G ~12G

Memory bandwidth

86.4GB/s (900MHz memory clock, 384 bit interface, 2 issues)

6.4GB/s (800MHz memory clock, 32 bit interface, 2 issues)

Access time of global memory

Slow (about 500 memory clock cycles)

Fast (about 5 memory clock cycles)

Page 9: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Abstract comparisons of memory between GPU and CPU (cont.)

CPU (Intel Pentium 4)GPU (NVIDIA GeForce 8800 GTX)

Register Register

Cache Texture cache or Constant cache

Main memory Shared memory

Hard disk Global memory, Texture memory, Constant memory

Page 10: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Memory coalescing

Several memory transactions can be coalesced into one transaction when consecutive threads access consecutive memory locations.

Due to access time of global memory is relatively large, it is important to achieve this.

Page 11: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

CUDA programming

Compute Unified Device Architecture The CPU code does the sequential

part. Highly parallelized part usually

implement in the GPU code, called kernel.

Calling GPU function in CPU code is called kernel launch.

Page 12: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Execution of GPU thread

Threads are grouped into thread blocks. Each thread block is assigned to a streamin

g multiprocessors (SMs), which contains multiple scalar processors (SPs), to be executed.

The actual execution of threads on SPs is done in groups of 32 threads, called warps.

SPs execute one warp at a time.

Page 13: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Motivating examples

What to fetch into shared memory? How to access shared memory? How to access global memory?

Page 14: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

What to fetch into shared memory?

A simple program that does not use shared memory.

Page 15: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

What to fetch into shared memory? (cont.)

If we fetch row*MAX+col+1 to the shared memory…

Page 16: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

What to fetch into shared memory? (cont.)

Generally, higher data reuse should imply better performance. => But may not be true here.

This counter-intuitive result is mainly caused by global memory access coalescing.

Page 17: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

How to access shared memory?

In Figure 2, data is accessed in a column-wise manner, as shown at Line 4, 9, 11, and 16.

What if we change into row-wise manner (i.e. s_in[tIdx.y][tIdx.x]) or skewing the access pattern (i.e. __shared__ float s_in[BLKDIM][BLKDIM+1])?

Page 18: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

How to access shared memory? (cont.)

Shared memory bank conflicts occur if there are multiple requests to different addresses in the same bank. In this case, the requests are serialized.

Page 19: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

How to access global memory? A programmer might have designed the global memory

write reference at Line 18 in Figure 2 to be in a column-wise manner as in out[col*MAX+row].

This unexpected slowdown is caused by channel skew. Channel skew is the ratio of the number of concurrent accesses to the most used channel to theleast used channel.

Page 20: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Previous works

[20] modeled the amount of parallelism employed in a program and the efficiency of a single kernel execution in a thread. Did not consider memory performance and their analysis

is only for compute intensive benchmarks. [8] includes the effect of parallelism to hide global

memory access latency. Does not take into account branch divergence.

[14][15][16][17][18] automate optimization of GPGPU applications. None of the above work comes up with a comprehensive

performance metric to estimate the efficiency of memory access pattern.

Page 21: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

CuMAPz overview

Page 22: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Data Reuse Profit Estimation

CuMAPz maintains a counter to count the number of times shared memory buffers are accessed. The degree of data reuse is represented in a term, data reuse, as follows:

Page 23: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Coalesced Access Profit Estimation

Due to coalescing, the actual transfer size that will consume bus width can be different from the size of data requested from threads. CuMAPz calculates the bandwidth utilization as the following:

Page 24: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Channel Skew Cost Estimation

Channel skew refers to the case where the concurrent memory accesses are not evenly distributed to all the channels but focused on only a few channels.

When a kernel is launched, threads blocks are assigned to SMs in a sequential order so that adjacent blocks are executed on adjacent SMs. Then, it becomes unpredictable after the first round of schedule since the order in which thread blocks finish the execution cannot be determined [13].

Page 25: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Channel Skew Cost Estimation (cont.)

The impact of channel skew can be stated in figures as the skewness of mapping to channels which can be calculated as follows:

Page 26: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Bank Conflict Cost Estimation

Similarly to global memory channels, shared memory space is divided into multiple banks. Each bank can serve one address at a time.

Efficiency of shared memory access is modeled as follows:

Page 27: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Branch Divergence Cost Estimation

Branches are introduced when there is uncovered region that is not buffered into shared memory, as shown at Line 6 and 13 in Figure 2.

When threads in a warp take different execution paths, then all paths are serialized.

We simply model the impact of branch divergence as follows:

Page 28: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Overall Memory Performance Estimation

Memory performance estimation is calculated by the following formula.

Page 29: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Experimental results

Environments Using C language. CUDA driver version 3.2 on NVIDIA Tesla

C1060. Benchmark are from benchmark suites in

[6], and CUDA SDK.

Page 30: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Two experiments

Validation: studying the correlation between our memory performance estimation and the performance of the benchmarks for different ways.

Performance Optimization: trying to find the best way to accesses shared and global memory using CuMAPz and the previous technique [8].

Page 31: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Validation

Page 32: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Performance optimization

Page 33: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Runtime Considerations

The timing complexity of the CuMAPz analysis is O(|W|*|R|*|B|), where W, R, and B are the set of all warps, global memory references, and shared memory buffers respectively.

Page 34: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Limitations

Compile-time analysis Cannot handle any information that can

only be determined during run-time. Assume adequate occupancy

A measure of how many thread blocks can be scheduled on one SM so that the hardware is kept busy.

Page 35: CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory, Arizona State University

Conclusions

GPU is a new platform for high-performance computing.

Develops CuMAPz to analyze memory performance of CUDA.

Considering many aspects like channel skew, etc.

Experimental results show very high correlation between the actual execution times and CuMAPz estimation.