Upload
others
View
49
Download
0
Embed Size (px)
Citation preview
NVIDIA Visual Profiler &
CUDA-MEMCHECK
Visual Profiler – Overview
• Included in CUDA Toolkit
• Visualize and optimize performance of a CUDA application
• Shows timeline on CPU and GPU
• nvvp (GUI)
• nvprof (Terminal)
• Two types: – Executable session
– Imported session (importing data generated by nvprof)
• Generate pdf report
Getting started
Timeline View
• CPU activity
• GPU activity
• Shows start & end of
– Threads
– Kernels
– Memcpy
– …
• Zoom, filter, reorder, …
Analysis View
• Guided or unguided – For unguided compile with SET(LOCAL_CUDA_NVCC_FLAGS ${LOCAL_CUDA_NVCC_FLAGS] –lineinfo)
• CUDA Application Analysis – Application‘s overall GPU utilization
– Kernel performance (orders kernels according to optimization importance based on execution time and achieved occupancy)
• Performance-Critical Kernels – Detailed analysis of a selected kernel
• Compute, Bandwith, or Latency Bound
• Instruction and memory latency
– Examine occupancy
How many warps the kernel has active on the GPU, relative to the maximum number of warps supported by GPU
– Examine stall reasons
Could give insight why latency is still an issue for the kernel
• Compute resources
GPU compute resources could limit the performance of a kernel, if they are insufficient or poorly utilized
CUDA-MEMCHECK
• detects memory access errors
• Run time error detection
• Included in CUDA Toolkit
• Getting started:
– cuda-memcheck executable -options
best case:
Supported error detection
• Memory access error Errors due to out of bound or misaligned access to memory by global,
local, shared or global atomic access
• Hardware exception Errors reported by hardware error reporting mechanism
• Malloc/Free errors Errors due to incorrect use of malloc or free
• CUDA API errors Failure of CUDA API call
• cudaMalloc memory leaks Allocations of device memory which have not been freed
• Device heap memory leaks Allocations of device memory in device code which have not been freed
Example
__global__ : for device global memory __shared__ : for per block shared memory __local__ : for per thread local memory Information about type of access (read / write) Size of access in bytes Source file and line number Thread indices and block indices Memory address being accessed and type of access error