31
GPU Computing with Matlab® @ CBI Laboratory

GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Embed Size (px)

Citation preview

Page 1: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU Computing with Matlab®

@ CBI Laboratory

Page 2: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Overview• GPU History & Hardware– GPU History– CPU vs. GPU Hardware– Parallelism Design Points

• GPU Software Infrastructure ( CUDA )• Matlab Parallel Computing Toolbox, GPU

Computing• GPU nodes @ CBI Lab• Examples• Additional Features

2

Page 3: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU History

3

3D object model:

e.g. A circle of radius R, @ center (x,y,z)Color = BlueLight Source @ ( x,y,z )

2 Dimensional Screen Goal: Answer question, for pixel (X,Y) on the screen, what’s my (R,G,B) value

Page 4: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU History

4

3D object model:

e.g. A circle of radius R, @ center (x,y,z)Color = BlueLight Source @ ( x,y,z )

2 Dimensional Screen Much Parallelism Available &Screen refresh rate << Processor Clock rate

Page 5: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU History

5

3D object model:

e.g. A circle of radius R, @ center (x,y,z)Color = BlueLight Source @ ( x,y,z )

2 Dimensional Screen GPU Model: Assembly Line ConceptHigh Latency BUTHigh Throughput

Page 6: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU History

3 vertices(x1,y1,z1)(x2,y2,z2)(x3,y3,z3)

MATRIXMULTIPLICATION: e.g. Rotation

MATRIXMULTIPLICATION: e.g. Rotation

MATRIXMULTIPLICATION:e.g. Translation, Rotation, Scaling

MATRIXMULTIPLICATION:e.g. Translation, Rotation, Scaling

Many Independent Computations: Streams of Triangles & Vertices

MATRIXMULTIPLICATION:e.g. 3-D to 2-D Projection ( Perspective Projection )

MATRIXMULTIPLICATION:e.g. 3-D to 2-D Projection ( Perspective Projection )

3d3d

3d

2d

The more calculators: the more points we can move around in the same amount of time

screen

Page 7: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU History

MATRIXMULTIPLICATION: e.g. Rotation

MATRIXMULTIPLICATION: e.g. Rotation

MATRIXMULTIPLICATION:e.g. Translation, Rotation, Scaling

MATRIXMULTIPLICATION:e.g. Translation, Rotation, Scaling

Many Independent Computations: Streams of Triangles & Vertices

MATRIXMULTIPLICATION:e.g. 3-D to 2-D Projection ( Perspective Projection )

MATRIXMULTIPLICATION:e.g. 3-D to 2-D Projection ( Perspective Projection )

Why must we be limited to performing a single type of function?The answer involves the start of General Purpose GPU Computing.Allow the programmer to create custom functions ( a.k.a. kernels )

that run in parallel.

3d3d

3d

2D

screen

Page 8: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU vs. CPUDifferent Goals: Fast Food Restaurant vs. Anywhere there are long lines of people waiting

Higher Latency Lower Latency

Exceptionally High Throughput Good Throughput

•An individual may need to wait a long time in line, but many more people go through system during the course of a day.

•Workers are always kept busy, even if the current person say forgets a document and needs to wait for someone to deliver it, since there are many people waiting in line.

•More workers/ smaller desks per worker.

•Use as much of the building space as possible to add workers.

•An individual waits as little as possible in line.•Workers are always kept busy by having large local caches of supplies both at the store and at the work counters. •Subdivide 1 task into smaller tasks and increase the speed of each smaller task. ( ILP & Pipelining )•Try to find parallelism within 1 task ( out-of-order execution )•Try to predict what people may order to get a head start. ( Branch Prediction )•Trying to optimize for minimum wait time for a single user uses up resources ( workers + space where you could have put more workers )

Which column maps to CPU and which to GPU?

Page 9: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU vs. CPUDifferent Goals: Fast Food Restaurant vs. Anywhere there are long lines of people waiting

Higher Latency Lower Latency

Exceptionally High Throughput Good Throughput

•An individual may need to wait a long time in line, but many more people go through system during the course of a day.

•Workers are always kept busy, even if the current person say forgets a document and needs to wait for someone to deliver it, since there are many people waiting in line.

•More workers/ smaller desks per worker.

•Use as much of the building space as possible to add workers.

•An individual waits as little as possible in line.•Workers are always kept busy by having large local caches of supplies both at the store and at the work counters. •Subdivide 1 task into smaller tasks and increase the speed of each smaller task. ( ILP & Pipelining )•Try to find parallelism within 1 task ( out-of-order execution )•Try to predict what people may order to get a head start. ( Branch Prediction )•Trying to optimize for minimum wait time for a single user uses up resources ( workers + space where you could have put more workers )

CPUGPU

Page 10: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Parallelism Design Points• Key: Focus on dependency analysis• How much of your program is independent determines potential parallelism

( Amdahl’s Law ) …. For a fixed amount of work in the parallel section…• Gustafson’s Law: Do more work within parallel sections…• Data transfer vs. Compute ( Arithmetic Intensity )

– Cost of moving the data from CPU to GPU needs to be taken into account.– GPU may provide large benefit when ( compute >> data I/O )

• Going to the store to get 100 items with 10 workers: you ideally only want to make 1 trip for all 100 items

• Even if all 10 workers go to get their items in parallel, not much benefit if you make 10 round trips.

• Resource contention– Data transfer bandwidth

10

Page 11: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Parallelism Design Points• Resource limits ( memory, disk )• Hardware limits

– Memory cache line sizes, Memory alignment issues, Disk block sizes, Cache sizes, # Queues, etc.

• Physical data organization ( e.g. Row Major vs. Column Major )• Conditional (if-else) minimization

– Ideally you would hope to have 0 if statements in your functions…. Not always feasible for algorithm correctness.

• Synchronization– Algorithm correctness many times requires some type of synchronization

• Many more variables affect function, program, … as well as system level parallelism….– A function may be highly parallelizable, but overall system parallelism may

involve looking at different levels of parallel to achieve good solution.

11

Page 12: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU HardwareFermi Architecture[16]

Many resources are available at www.nvidia.com

Page 13: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU HardwareFermi Architecture[16]

Many resources are available at www.nvidia.com

Page 14: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU Software Infrastructure

CUDA: Compute Unified Device Architecture

14

GPU card(s) & System Board with CPU, Buses ( PCIe ),..

Operating System ( Linux, Windows, etc.)

CUDA Driver

CUDA Runtime API

CUDA LibrariesCUDA C/C++

NVCC Compiler + Utilities ( nvprof, visual profiler )

PTX: Parallel Thread eXecution Assembly

Language ( Virtual Machine )

CUBIN( Cuda Binary )

Applications ( e.g. Matlab )

Page 15: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU Software InfrastructureCUDA: Compute Unified Device ArchitectureSoftware model: An abstraction of the hardwareStreams: Compute & Data Transfer GPU1,GPU2…

Queues (order guaranteed within a single stream)

Grids: Run the same kernel( a.k.a. function ) GPU1,GPU2…

Blocks: Group of cooperating threads SM(Streaming Multi-processor )- 32 compute cores per SM in Fermi Architecture.- Blocks should be viewed as self contained work units

Warps: Groups of 32 threads SM ( Streaming Multi-processor )- The basic unit of execution, 32 threads running the same instruction in the same amount of time.

Threads: Execution context ( keeps track a core’s state) Compute Core

15

Software to Hardware Mapping

Page 16: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Matlab Parallel Computing Toolbox, GPU Computing

• gpuDevice(#)• gpuDeviceCount()• reset(gpuDevice(#))• wait()• bsxfun()• gpuArray()• gather()• arrayfun()• existsOnGPU()• parallel.gpu.CUDAKernel()• feval• setConstantMemory• Many GPU enabled built-in functions: e.g. fft, …. Check with:

– methods(‘gpuArray’)

16

Matlab Parallel Computing Toolbox:

Each release, more and more functions are

enabled for transparent GPU support.

Page 17: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Matlab Parallel Computing Toolbox, GPU Computing

• Many GPU enabled built-in functions: e.g. fft, …. Check with:– methods(‘gpuArray’)– fft,fft2,…. Many built in functions

– Try running >> methods(‘gpuArray’) to see the list of support functions.

17

Page 18: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU Nodes @ CBI Lab

• 2 modes: Interactive & Batch• Interactive: Use for development • $ ssh –Y [email protected]

$ qlogin -q gpu.q -l gpuonly$ matlab &

Batch mode: For production runs• Job Script

#!/bin/bash

#$ -q gpu.q#$ -l gpuonly[Source: http://www.cbi.utsa.edu/faq/sge/gpu]

18

Nvidia M2070: Fermi Architecture, 448 cuda cores, 14 Multiprocessors, @ 32 cuda cores/Multi Processor

Putty+Xming can be used to access Matlab GUI from Windows system.

http://cbi.utsa.edu/faq/xforwarding

Page 19: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU Nodes @CBI Lab

19

qlogin –q gpu.q –l gpuonly

Matlab GUI access is also available from Windows, using Putty + x11 forwarding with XMing

Page 20: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU Nodes @ CBI Lab

20

matlab &

nvidia-smi

top

>> gpuDevice(#)

Page 21: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU Nodes @ CBI Lab

21

Page 22: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

GPU Nodes @ CBI Lab

22

M2070: Fermi Architecture, 448 CUDA cores, 14 Multiprocessors, @ 32 cores/Multi Processor

Page 23: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Built-in function support for GPU• 4x + y - 2z = 0• 2x -3y + 3z = 9• -6x -2y + z = 0

• A*x = b

• A = [4 1 -2; 2 -3 3; -6 -2 1];• b = [0; 9; 0];• What is x?

– x = A\b; x = [ 0.75, -2, 0.5 ]; 4*0.75 + (-2) – (2*0.5) = 0 ??? should match if correct solution of system2*0.75 + (-3*-2) + (3*0.5 ) = 9 ??? should match if correct solution of system-6*0.75 + (-2*-2) + 0.5 = 0 ??? should match if correct solution of system

23

Quickly solving sets of linear equations has applications throughout science & engineering.

\ operator is one of many functions that work on gpuArray data types.

Page 24: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Many Additional Features• Using Matlab with GPU in Batch mode via Job

Script• Calling .cu , .ptx code directly from Matlab• Using the GPU from C/C++ code directly with

the MEX interface– Allows incorporating custom GPU code into

Matlab as well as using Nvidia Nsight and Nvidia Visual Profiler for custom GPU algorithm development.

24

Page 25: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Demo

25

An example Matlab code running on a GPU system.

Page 26: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Appendix

26

Many applications are being enabled for GPU acceleration:

e.g.NAMD for Molecular Dynamics using GPU

http://www.nvidia.com/object/gpu-applications.htmlhttp://www.nvidia.com/content/tesla/pdf/gpu-accelerated-applications-for-hpc.pdf

C/C++/Fortran Library:Accelereyes Arrayfire

https://developer.nvidia.com/accelereyes-arrayfirehttp://www.accelereyes.com/examples/case_studies

Page 27: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Appendix

27CUDA Internals: Valgrind+ Kcachegrind: libcudart.so visualization

Page 28: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Appendix

28CUDA Internals: Valgrind+ Kcachegrind: libcudart.so visualization

Page 29: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

References[1] http://www.mathworks.com/help/distcomp/release-notes.html[2] http://www.mathworks.com/help/distcomp/examples/benchmarking-a-b-on-the-gpu.html[3] http://www.mathworks.com/help/distcomp/examples/illustrating-three-approaches-to-gpu-computing-the-mandelbrot-set.html[4] http://www.mathworks.com/help/distcomp/executing-cuda-or-ptx-code-on-the-gpu.html[5] http://www.nvidia.com/docs/IO/105880/DS-Tesla-M-Class-Aug11.pdf[6] http://en.wikipedia.org/wiki/Nvidia_Tesla#cite_note-11[7] http://en.wikipedia.org/wiki/Rasterisation[8] http://en.wikipedia.org/wiki/Perspective_projection#Perspective_projection[9] http://en.wikipedia.org/wiki/GPGPU[10] http://www.cbi.utsa.edu/faq/sge/gpu[11] http://medim.sth.kth.se/6l2872/F/F11c.pdf (FFT registration )[12] http://medim.sth.kth.se/6l2872/F/F11c.pdf[13] http://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf[14] http://www.nvidia.com/docs/IO/122874/K20-and-K20X-application-performance-technical-brief.pdf[15] http://en.wikipedia.org/wiki/Nvidia_Tesla[16] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf[17] http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf[18] https://www.udacity.com/wiki/cs344/Lesson_1_-_The_GPU_Programming_Model#latency-vs-bandwidth[19] https://www.udacity.com/wiki/cs344[20] http://www.computingbook.org/FullText.pdf[21] http://en.wikipedia.org/wiki/Dynamic_random-access_memory[22] http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2009/lec08-cache.html[23] http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/computer-architecture-2012/lec03-fastest.html[24] http://en.wikipedia.org/wiki/Gustafson%27s_law[25] http://archive.hpcwire.com/hpc/705814.html[26] http://www.johngustafson.net/pubs/pub13/amdahl.pdf[27] http://spartan.cis.temple.edu/shi/public_html/docs/amdahl/amdahl.html[28] http://software.intel.com/en-us/articles/amdahls-law-gustafsons-trend-and-the-performance-limits-of-parallel-applications

29

Page 30: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Acknowledgements

• This project received computational, research & development, software design/development support from the Computational System Biology Core/Computational Biology Initiative, funded by the National Institute on Minority Health and Health Disparities (G12MD007591) from the National Institutes of Health. URL: http://www.cbi.utsa.edu

30

Page 31: GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software

Contact Us

http://cbi.utsa.edu

31