47
Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS

OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

Mayank Jain, 11 May 2017

OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS

Page 2: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

2

AGENDA

• CUDA Profiling Tools

• Unified Memory Profiling

• NVLink Profiling

• PC Sampling

• MPI Profiling

• Multi-hop Remote Profiling

• Volta Support

• Other Improvements

Page 3: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

3

CUDA PROFILING TOOLS

• NVIDIA® Visual Profiler

• nvprof *

• NVIDIA® Nsight™ Visual Studio Edition

* Android CUDA APK profiling not supported (yet)

Page 4: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

4

3RD PARTY PROFILING TOOLS

TAU Performance System ® VampirTrace

PAPI CUDA Component HPC Toolkit

Page 5: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

5

UNIFIED MEMORY PROFILING

Page 6: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

6

UNIFIED MEMORY EVENTS

Events

Page 7: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

7

SEGMENT MODE TIMELINE OPTIONS

Option to enable segment mode

Option to specify number of segments

Page 8: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

8

SEGMENT MODE TIMELINE

Segment mode interval Heat map for CPU page faults

Page 9: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

9

SWITCH TO NON-SEGMENT VIEW

Select settings view

Select this tab Load data within a specific time range

Uncheck

Page 10: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

10

NON-SEGMENTED MODE TIMELINE

Time range

Page 11: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

11

CPU PAGE FAULT SOURCE CORRELATION

Selected interval Source location

Page 12: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

12

CPU PAGE FAULT SOURCE CORRELATION

Source line causing CPU page fault

Page 13: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

13

CPU PAGE FAULT SOURCE CORRELATION

Unguided Analysis

Option to collect Unified Memory

information

Summary of all CPU page faults

Page 14: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

14

NEW UNIFIED MEMORY EVENTSPage throttling, Memory thrashing, Remote map

Memory thrashing

Page throttling

Remote map

Page 15: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

15

FILTER AND ANALYZE

1 Select unified memory in the unguided analysis section

2 Select required events and click on ‘Filter and Analyze’

Summary of filtered intervals

Page 16: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

16

FILTER AND ANALYZEUnfiltered

Page 17: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

17

FILTER AND ANALYZEFiltered

Filtered intervals

Page 18: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

18

UNOPTIMIZED APPLICATION

Memory Thrashing

Analyze read access page faults

and thrashing

12.2 ms

Read access page faults

Page 19: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

19

OPTIMIZATION

int threadsPerBlock = 256;int numBlocks = (length + threadsPerBlock – 1) / threadsPerBlock;

kernel<<< numBlocks , threadsPerBlock >>>(A, B, C, length);

OLD

NEW

int threadsPerBlock = 256;int numBlocks = (length + threadsPerBlock – 1) / threadsPerBlock;

cudaMemAdvise(A, size, cudaMemAdviseSetReadMostly, 0);cudaMemAdvise(B, size, cudaMemAdviseSetReadMostly, 0);

kernel<<< numBlocks , threadsPerBlock >>>(A, B, C, length);

Page 20: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

20

OPTIMIZED APPLICATION

No DtoH Migrations and thrashing

2.9 ms

Speedup 4x (2.9 vs 12.2)

Page 21: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

21

NVLINK PROFILING

Page 22: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

22

NVLINK VISUALIZATIONVisual Profiler

Color codes for

NVLink

Topology

Static

properties

Runtime

values

Option to collect

NVLink information

Unguided Analysis

Achieved

throughputSelected

NVLink

Page 23: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

23

DGX-1V NVLINK TOPOLOGY

Page 24: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

24

NVLINK EVENTS ON TIMELINE

MemCpy API

NVLink Events on

Timeline

Color Coding of

NVLink Events

Page 25: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

25

NVLINK ANALYSISStage I: Data Movement Over PCIe

216 milliseconds

Page 26: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

26

NVLINK ANALYSISStage II: Data Movement Over NVLink

65 milliseconds

Under-utilized

NVLink

Minimal-Unused

NVLinks

Speedup 4x (216 vs 65)

Page 27: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

27

NVLINK ANALYSISStage III: Data Movement Over NVLink with Streams

20 milliseconds

Changed Color based

on achieved

bandwidth

Changed color based

on selection

Speedup 3x (65 vs 20)

Page 28: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

28

INSTRUCTION LEVEL PROFILING(PC SAMPLING)

Page 29: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

29

PC SAMPLING

PC sampling feature is available for device with CC >= 5.2

Provides CPU PC sampling parity + additional information for warp states/stalls reasons for GPU kernels

Effective in optimizing large kernels, pinpoints performance bottlenecks at specific lines in source code or assembly instructions

Samples warp states periodically in round robin order over all active warps

No overheads in kernel runtime, CPU overheads to parse the records

Page 30: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

30

PC SAMPLING UI

Pie chart for sample distribution for a CUDA function

Source-Assembly view

Page 31: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

31

MPI PROFILING

Page 32: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

32

MPI PROFILING

$ mpirun -n 4 nvprof --process-name "MPI Rank %q{OMPI_COMM_WORLD_RANK}" --context-name "MPI Rank %q{OMPI_COMM_WORLD_RANK}" -o timeline.%q{OMPI_COMM_WORLD_RANK}.pdm ./simpleMPIRunning on 4 nodes==21977== NVPROF is profiling process 21977, command: ./simpleMPI==21983== NVPROF is profiling process 21983, command: ./simpleMPI==21979== NVPROF is profiling process 21979, command: ./simpleMPI==21982== NVPROF is profiling process 21982, command: ./simpleMPI<program output>==21982== Generated result file: timeline.0.pdm==21977== Generated result file: timeline.3.pdm==21983== Generated result file: timeline.1.pdm==21979== Generated result file: timeline.2.pdm

nvprof

Page 33: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

33

MPI PROFILINGnvprof daemon mode

$ nvprof --profile-all-processes-o out.%h.%p.%q{OMPI_COMM_WORLD_RANK}<nvprof listens in daemon mode>

<profiling data is generated>

$ mpirun -n 4 ./simpleMPI

1

2

3

Shell 1 Shell 2

Page 34: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

34

MPI PROFILING

1 4

Importing into the Visual Profiler

2 3

Page 35: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

35

MPI PROFILINGVisual Profiler

MPI Rank-based naming

NVTX Markers & Ranges

Page 36: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

36

MPI PROFILINGMPI + NVTX

nvtxEventAttributes_t range = {0};range.message.ascii = "MPI_Scatter";nvtxRangePushEx(range);int result = MPI_Scatter(...);nvtxRangePop();

Auto-generate mpi_interception.so

LD_PRELOAD=mpi_interception.so

Run your MPI app with nvprof.

MPI calls will be auto-annotated using NVTX.

Manual mode Interception mode

1

2

3

https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-track-mpi-calls-nvidia-visual-profiler/

Page 37: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

37

MPI PROFILINGInterception

int res = MPI_Scatter(...);

MPI app Interception library(LD_PRELOAD)

int MPI_Scatter(...) { nvtxRangePushEx(range);int res = PMPI_Scatter(...);nvtxRangePop();return res;

}

MPI library

int PMPI_Scatter(...)

Page 38: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

38

REMOTE PROFILING

Page 39: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

39

NVVP: MULTI-HOP REMOTE PROFILING

Host Compute NodeLogin Node

Visual Profiler ScriptCUDA

Applicationssh

ssh

scp

Page 40: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

40

NVVP: MULTI-HOP REMOTE PROFILING

1 2 3

Login Node

Script

Connect Visual Profiler to the login node

Configure script on the login node

Use the custom script option

One-Time Setup

Page 41: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

41

NVVP: MULTI-HOP REMOTE PROFILING

1 2 Application transparently runs on compute node and profiling data is displayed in the Visual Profiler

Select custom script, then create a remote session as usual

Application Profiling

Page 42: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

42

VOLTA SUPPORT

Page 43: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

43

VOLTA SUPPORT

GV100 Device

Attributes

GPU Trace

Page 44: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

44

OTHER IMPROVEMENTS

Page 45: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

45

OTHER IMPROVEMENTS

Tracing and profiling of Cooperative Kernel launches is supported

The Visual Profiler supports remote profiling to systems supporting ssh algorithms

with a key length of 2048 bits

OpenACC profiling is now supported on systems without CUDA setup

nvprof flushes all profiling data when a SIGINT or SIGKILL signal is encountered

Page 46: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory

46

REFERENCES

NVIDIA toolkit documentation: http://docs.nvidia.com

CUDA Profiler User Guide: http://docs.nvidia.com/cuda/profiler-users-guide/index.html

Other GTC 2017 sessions:

S7824 - DEVELOPER TOOLS UPDATE IN CUDA 9

S7519 - DEVELOPER TOOLS FOR AUTOMOTIVE, DRONES AND INTELLIGENT CAMERAS APPLICATIONS

S7445 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING WHOLE APPLICATION PERFORMANCE

S7444 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS

Page 47: OPTIMIZING APPLICATION PERFORMANCE WITH …...Mayank Jain, 11 May 2017 OPTIMIZING APPLICATION PERFORMANCE WITH CUDA PROFILING TOOLS 2 AGENDA • CUDA Profiling Tools • Unified Memory