OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION...

Mayank Jain, 11 May 2017

OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS

AGENDA

• CUDA Profiling Tools

• Unified Memory Profiling

• NVLink Profiling

• PC Sampling

• MPI Profiling

• Multi-hop Remote Profiling

• Volta Support

• Other Improvements

CUDA PROFILING TOOLS

• NVIDIA® Visual Profiler

• nvprof *

• NVIDIA® Nsight™ Visual Studio Edition

* Android CUDA APK profiling not supported (yet)

3RD PARTY PROFILING TOOLS

TAU Performance System ® VampirTrace

PAPI CUDA Component HPC Toolkit

UNIFIED MEMORY PROFILING

UNIFIED MEMORY EVENTS

Events

SEGMENT MODE TIMELINE OPTIONS

Option to enable segment mode

Option to specify number of segments

SEGMENT MODE TIMELINE

Segment mode interval Heat map for CPU page faults

SWITCH TO NON-SEGMENT VIEW

Select settings view

Select this tab Load data within a specific time range

Uncheck

NON-SEGMENTED MODE TIMELINE

Time range

CPU PAGE FAULT SOURCE CORRELATION

Selected interval Source location

Source line causing CPU page fault

Unguided Analysis

Option to collect Unified Memory

information

Summary of all CPU page faults

NEW UNIFIED MEMORY EVENTSPage throttling, Memory thrashing, Remote map

Memory thrashing

Page throttling

Remote map

FILTER AND ANALYZE

1 Select unified memory in the unguided analysis section

2 Select required events and click on ‘Filter and Analyze’

Summary of filtered intervals

FILTER AND ANALYZEUnfiltered

FILTER AND ANALYZEFiltered

Filtered intervals

UNOPTIMIZED APPLICATION

Memory Thrashing

Analyze read access page faults

and thrashing

12.2 ms

Read access page faults

OPTIMIZATION

int threadsPerBlock = 256;int numBlocks = (length + threadsPerBlock – 1) / threadsPerBlock;

kernel<<< numBlocks , threadsPerBlock >>>(A, B, C, length);

int threadsPerBlock = 256;int numBlocks = (length + threadsPerBlock – 1) / threadsPerBlock;

cudaMemAdvise(A, size, cudaMemAdviseSetReadMostly, 0);cudaMemAdvise(B, size, cudaMemAdviseSetReadMostly, 0);

kernel<<< numBlocks , threadsPerBlock >>>(A, B, C, length);

OPTIMIZED APPLICATION

No DtoH Migrations and thrashing

2.9 ms

Speedup 4x (2.9 vs 12.2)

NVLINK PROFILING

NVLINK VISUALIZATIONVisual Profiler

Color codes for

NVLink

Topology

Static

properties

Runtime

values

Option to collect

NVLink information

Unguided Analysis

Achieved

throughputSelected

NVLink

DGX-1V NVLINK TOPOLOGY

NVLINK EVENTS ON TIMELINE

MemCpy API

NVLink Events on

Timeline

Color Coding of

NVLink Events

NVLINK ANALYSISStage I: Data Movement Over PCIe

216 milliseconds

NVLINK ANALYSISStage II: Data Movement Over NVLink

65 milliseconds

Under-utilized

NVLink

Minimal-Unused

NVLinks

Speedup 4x (216 vs 65)

NVLINK ANALYSISStage III: Data Movement Over NVLink with Streams

20 milliseconds

Changed Color based

on achieved

bandwidth

Changed color based

on selection

Speedup 3x (65 vs 20)

INSTRUCTION LEVEL PROFILING(PC SAMPLING)

PC SAMPLING

PC sampling feature is available for device with CC >= 5.2

Provides CPU PC sampling parity + additional information for warp states/stalls reasons for GPU kernels

Effective in optimizing large kernels, pinpoints performance bottlenecks at specific lines in source code or assembly instructions

Samples warp states periodically in round robin order over all active warps

No overheads in kernel runtime, CPU overheads to parse the records

PC SAMPLING UI

Pie chart for sample distribution for a CUDA function

Source-Assembly view

MPI PROFILING

$ mpirun -n 4 nvprof --process-name "MPI Rank %q{OMPI_COMM_WORLD_RANK}" --context-name "MPI Rank %q{OMPI_COMM_WORLD_RANK}" -o timeline.%q{OMPI_COMM_WORLD_RANK}.pdm ./simpleMPIRunning on 4 nodes==21977== NVPROF is profiling process 21977, command: ./simpleMPI==21983== NVPROF is profiling process 21983, command: ./simpleMPI==21979== NVPROF is profiling process 21979, command: ./simpleMPI==21982== NVPROF is profiling process 21982, command: ./simpleMPI<program output>==21982== Generated result file: timeline.0.pdm==21977== Generated result file: timeline.3.pdm==21983== Generated result file: timeline.1.pdm==21979== Generated result file: timeline.2.pdm

nvprof

MPI PROFILINGnvprof daemon mode

$ nvprof --profile-all-processes-o out.%h.%p.%q{OMPI_COMM_WORLD_RANK}<nvprof listens in daemon mode>

$ mpirun -n 4 ./simpleMPI

Shell 1 Shell 2

MPI PROFILING

Importing into the Visual Profiler

MPI PROFILINGVisual Profiler

MPI Rank-based naming

NVTX Markers & Ranges

MPI PROFILINGMPI + NVTX

nvtxEventAttributes_t range = {0};range.message.ascii = "MPI_Scatter";nvtxRangePushEx(range);int result = MPI_Scatter(...);nvtxRangePop();

Auto-generate mpi_interception.so

LD_PRELOAD=mpi_interception.so

Run your MPI app with nvprof.

MPI calls will be auto-annotated using NVTX.

Manual mode Interception mode

https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-track-mpi-calls-nvidia-visual-profiler/

MPI PROFILINGInterception

int res = MPI_Scatter(...);

MPI app Interception library(LD_PRELOAD)

int MPI_Scatter(...) { nvtxRangePushEx(range);int res = PMPI_Scatter(...);nvtxRangePop();return res;

MPI library

int PMPI_Scatter(...)

REMOTE PROFILING

NVVP: MULTI-HOP REMOTE PROFILING

Host Compute NodeLogin Node

Visual Profiler ScriptCUDA

Applicationssh

Login Node

Script

Connect Visual Profiler to the login node

Configure script on the login node

Use the custom script option

One-Time Setup

1 2 Application transparently runs on compute node and profiling data is displayed in the Visual Profiler

Select custom script, then create a remote session as usual

Application Profiling

VOLTA SUPPORT

GV100 Device

Attributes

GPU Trace

OTHER IMPROVEMENTS

Tracing and profiling of Cooperative Kernel launches is supported

The Visual Profiler supports remote profiling to systems supporting ssh algorithms

with a key length of 2048 bits

OpenACC profiling is now supported on systems without CUDA setup

nvprof flushes all profiling data when a SIGINT or SIGKILL signal is encountered

REFERENCES

NVIDIA toolkit documentation: http://docs.nvidia.com

CUDA Profiler User Guide: http://docs.nvidia.com/cuda/profiler-users-guide/index.html

Other GTC 2017 sessions:

S7824 - DEVELOPER TOOLS UPDATE IN CUDA 9

S7519 - DEVELOPER TOOLS FOR AUTOMOTIVE, DRONES AND INTELLIGENT CAMERAS APPLICATIONS

S7445 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING WHOLE APPLICATION PERFORMANCE

S7444 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS

OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION...

Documents

Optimizing Matrix Transpose in CUDA - Nvidiadeveloper.download.nvidia.com/.../transpose/doc/MatrixTranspose.pdf · Optimizing Matrix Transpose in CUDA 4 June 2010 sequence of CUDA

Code GPU with CUDA - Optimizing memory and control flow

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology

Optimizing CUDA Shared Memory Usage - SC15sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/... · Optimizing CUDA Shared Memory Usage Shuang Gao ... through manual

CUDA Performance and Profiling

GPU PROFILING 기법을통한 DEEP LEARNING 성능 최적화기법소개 · 2019-07-22 · Interactive CUDA API debugging and kernel profiling Fast Data Collection Graphical and multiple

NVIDIA PROFILING TOOLS€¦ · OpenMP profiling Tracing support for CUDA kernels, memcpy and memset nodes launched by a CUDA Graph Support for version 3 NVIDIA Tools Extension API

Optimizing Reflow Profiling in Lead-free SMT Assemblykicthermal.com/.../2012/11/Optimizing-Reflow-Profiling-in-Lead-free... · Optimizing Reflow Profiling in Lead-free SMT Assembly

Optimizing Parallel Reduction in CUDA · 2009. 5. 8. · 2 Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves

SDAccel Environment Profiling and Optimization Guide...Optimizing Computational Parallelism Loop Parallelism and Task Parallelism Optimizing Computational Units Optimizing Memory Architecture

CUDA Flux: A Lightweight Instruction Profiler for CUDA ......Currently Available Tools for Profiling Hardware performance-counter based: nvprof • CUDA API trace • Light to heavy

Optimizing CUDA for GPU Architecture - Macalester College · 3.3 CUDA best practices ... When programming in CUDA C we work with blocks of ... Optimizing CUDA for GPU Architecture,

Profiling and optimizing go programs

Parallel&Programming& Paradigms&liacs.leidenuniv.nl/~rietveldkfd/courses/parco2015/Lecture_11.pdf · Profiling CUDA-GDB debugger NVIDIA Visual Profiler &&&&Development &&&&Environment

CUDA Without Cuda (CUDA Libraries) - Nvidiadeveloper.download.nvidia.com/CUDA/training/ntrotoCUDALibraries.pdf · CUDA Without Cuda (CUDA Libraries) GPU Computing Webinar 7/16/2011

Profiling & Optimizing Mobile Games on Android Devicestwvideo01.ubm-us.net/o1/vault/gdcchina14/presentations/833760_Remi... · Profiling & Optimizing Mobile Games on Android Devices

CUDA Profiling and Debugging - GitHub Pagescis565-fall-2013.github.io/lectures/09-25-CUDA-Profiling... · 2013. 12. 7. · Debugging and Profiling using CUDA Tools Memory Coalescing

Profiler User's Guide · 2019. 4. 29. · Preparing An Application For Profiling Profiler User's Guide Version 2019 | 2 To limit profiling to a region of your application, CUDA provides

Optimizing CUDA – Part II - Nvidia · 2009-08-04 · Optimizing CUDA – Part II. Outline Execution Configuration Optimizations Instruction Optimizations Multi-GPU ... Register

Optimizing and Profiling Unity Games for Mobile Platforms · 2016-12-08 · Optimizing and Profiling Unity Games for Mobile Platforms Angelo Theodorou Senior Software Engineer,