C++ AMP: Accelerated Massive Parallelism in Visual C++ 11 Kate Gregory Gregory Consulting @gregcons

C++ AMP: Accelerated Massive Parallelism in Visual C++ 11Kate GregoryGregory Consultingwww.gregcons.com/kateblog, @gregcons

C++ is the language for performance

If you need speed at runtime, you use C++Frameworks and libraries can make your code faster

Eg PPL: use all the CPU coresWith little or no change to your logic and code

Experienced C++ developers are productive in C++Don’t want to go back to C or C-like languageEnjoy the tool support in Visual Studio

Many C++ developers value portabilityWrite standard C++, compile with anythingUse portable libraries, run anywhereEven in all-Microsoft universe, simple deployment is important

Cartoonizer

What is C++ AMP?Accelerated Massive Parallelism

Run your calculations on one or more acceleratorsToday, GPU is the accelerator you useEventually: other kinds of accelerators

Write your whole application in C++Not a “C-like” language or a separate resource you link inUse Visual Studio and familiar toolsSpeed up 20x, 50x, or more

Basically a libraryComes with Visual Studio 2012, included in vcredistSpec is open – other platforms/compilers can implement it too

Agenda

Hardware ReviewC++ AMP FundamentalsTilingDebugging and VisualizingCall to Action

CPUs vs GPUs today

Low memory bandwidthHigher power consumptionMedium level of parallelismDeep execution pipelinesRandom accessesSupports general codeMainstream programming

High memory bandwidthLower power consumptionHigh level of parallelismShallow execution pipelinesSequential accessesSupports data-parallel codeNiche programming

images source: AMD

C++ AMP is fundamentally a library

Comes with Visual C++ 2012#include <amp.h>Namespace: concurrencyNew classes:

array, array_viewextent, indexaccelerator, accelerator_view

New function(s): parallel_for_each()New (use of) keyword: restrict

Asks compiler to check your code is ok for GPU (DirectX)

parallel_for_each

Entry point to the libraryTakes number (and shape) of threads neededTakes function or lambda to be done by each thread

Must be restrict(amp) Sends the work to the accelerator

Scheduling etc handled thereReturns – no blocking/waiting

void AddArrays(int n, int * pA, int * pB, int * pSum){

for (int i=0; i<n; i++)

{ pSum[i] = pA[i] + pB[i]; }

#include <amp.h>using namespace concurrency;

void AddArrays(int n, int * pA, int * pB, int * pSum){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pSum); parallel_for_each( sum.extent, [=](index<1> i) restrict(amp) { sum[i] = a[i] + b[i]; } );}

Hello World: Array Addition

void AddArrays(int n, int * pA, int * pB, int * pSum){

for (int i=0; i<n; i++)

{ pSum[i] = pA[i] + pB[i]; }

Basic Elements of C++ AMP codingvoid AddArrays(int n, int * pA, int * pB, int * pSum){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pSum); parallel_for_each(

sum.extent, [=](index<1> i) restrict(amp) { sum[i] = a[i] + b[i];

array_view variables captured and associated data copied to accelerator (on demand)

restrict(amp): tells the compiler to check that this code conforms to C++ AMP language restrictions

parallel_for_each: execute the lambda on the accelerator once per thread

extent: the number and shape of threads to execute the lambda

index: the thread ID that is running the lambda, used to index into data

array_view: wraps the data to operate on the accelerator

extent<N> - size of an N-dim space

index<N> - an N-dimensional point

View on existing data on the CPU or GPUDense in least significant dimensionOf element T and rank NRequires extentRectangularAccess anywhere (implicit sync)

vector<int> v(10);

extent<2> e(2,5); array_view<int,2> a(e, v);

array_view<T,N>

//above two lines can also be written//array_view<int,2> a(2,5,v);

index<2> i(1,3);

int o = a[i]; // or a[i] = 16;//or int o = a(1, 3);

Matrix Multiplication

C00 = A00 * B00 + A01 * B10 + A02 * B20 + A03 * B30

restrict(amp) restrictions

Can only call other restrict(amp) functionsAll functions must be inlinableOnly amp-supported types

int, unsigned int, float, double, boolstructs & arrays of these types

Pointers and ReferencesLambdas cannot capture by reference, nor capture pointersReferences and single-indirection pointers supported only as local variables and function arguments

restrict(amp) restrictions

No recursion'volatile'virtual functionspointers to functionspointers to member functionspointers in structspointers to pointersbitfields

No goto or labeled statementsthrow, try, catchglobals or staticsdynamic_cast or typeidasm declarationsvarargsunsupported types

e.g. char, short, long double

vector<int> v(8 * 12);extent<2> e(8,12);accelerator acc = …array<int,2> a(e,acc.default_view);copy_async(v.begin(), v.end(), a);

array<T,N>

Multi-dimensional array of rank N with element TContainer whose storage lives on a specific acceleratorCapture by reference [&] in the lambdaExplicit copyNearly identical interface to array_view<T,N>

parallel_for_each(e, [&](index<2> idx) restrict(amp) { a[idx] += 1;});copy(a, v.begin());

Tiling

Rearrange algorithm to do the calculation in tilesEach thread in a tile shares a programmable cache

tile_static memoryAccess 100x as fast as global memoryExcellent for algorithms that use each piece of information again and again

Overload of parallel_for_each that takes a tiled extent

Race Conditions in the Cache

Because a tile of threads shares the programmable cache, you must prevent race conditions

Tile barrier can ensure a waitTypical pattern:

Each thread does a share of the work to fill the cacheThen waits until all threads have done that workThen uses the cache to calculate a share of the answer

Visual Studio 2012

DebuggingEverything you had before, plus: GPU ThreadsParallel StacksParallel Watch

Visualizing

Debugging

Can hit either CPU breakpoints or GPU breakpointsNew UI in VS 2012 to control that

GPU breakpoints: Only on Windows 8 today

Debugging Properties

Values, Call Stacks, etc

GPU Threads Window

Shows progress through the calculation

Parallel Watch

Shows values across multiple threads

And more!

Race Condition Detection Parallel StacksFlagging, Filtering, and GroupingFreezing and ThawingRun Tile to Cursor

Concurrency Visualizer

Shows activity on CPU and GPUCan highlight relative times for specific parts of a calculationOr copy times to/from the accelerator

C++ AMP is…

C++The language you knowExcellent productivity The language you choose when performance matters

Implemented as (mostly) a libraryVariety of application types

Well supported by Visual Studio 2012DebuggerConcurrency VisualizerEverything else you already use

Can be supported by other compilers and platformsOpen spec

Learn C++ AMP

book http://www.gregcons.com/cppamp/

training http://www.acceleware.com/cpp-amp-training

videos http://channel9.msdn.com/Tags/c++-accelerated-massive-parallelism

articles http://blogs.msdn.com/b/nativeconcurrency/archive/2012/04/05/c-amp-articles-in-msdn-magazine-april-issue.aspx

samples http://blogs.msdn.com/b/nativeconcurrency/archive/2012/01/30/c-amp-sample-projects-for-download.aspx

guides http://blogs.msdn.com/b/nativeconcurrency/archive/2012/04/11/c-amp-for-the-cuda-programmer.aspx

spec http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/03/c-amp-open-spec-published.aspx

forum http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads

http://blogs.msdn.com/nativeconcurrency/

Call to Action

Get Visual Studio 2012Download some samplesPlay with debugger and other toolsTry writing a C++ AMP application of your own

Console (command prompt)WindowsMetro style for Windows 8

Measure your performance and see the difference

C++ AMP: Accelerated Massive Parallelism in Visual C++ 11 Kate Gregory Gregory Consulting @gregcons

Documents

Georgina Gregory Guitar Manuscript.edimburg c.1830 40

C/C++ Programming Abstractions for Parallelism and Concurrency

Hardware Parallelism vs. Software Parallelism · 2019-02-25 · Hardware shared library ... Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism

Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie

Parallelism: Avoiding Faulty Parallelism

Parallelism in C++ - Louisiana State Universitystellar.cct.lsu.edu/pubs/parallelism_in_cpp.pdf · Parallelism in C++ •C++11 introduced lower level abstractions std::thread, std::mutex,

Concurrency and Parallelism with C++17 and C++20/23 · Concurrency and Parallelism with C++17 and C++20/23 Rainer Grimm Training, Coaching and, Technology Consulting

PARALLELISM PARALLELISM PARALLELISM

Strict Fork-Join Parallelism · N3409=12-0099: Strict Fork-Join Parallelism Page 2 of 18 1 Vision Parallel C++ programming should be a seamless extension of serial C++ programming

Parallelism: Memory Consistency Models and Scheduling Todd C. Mowry & Dave Eckhardt

Parallelism and Orders of Signification (Parallelism ...journal.oraltradition.org/files/articles/31ii/11_31.2.pdf · Parallelism and Orders of Signification (Parallelism Dynamics

Parallator Transformation towards C++ parallelism using … · Parallator Transformation towards C++ parallelism using TBB ... chitecture from uniprocessor systems to multiprocessing

Parallelism in C++pdiehl/teaching/2019/4977/Parallelism in C… · Resume consumer thread Locality 2 Execute Future: Producer thread Future object Result is being returned Enables

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism

1. Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism

Modernizing Legacy C++ Code KATE GREGORY GREGORY CONSULTING LIMITED @GREGCONS JAMES MCNELLIS MICROSOFT VISUAL C++ @JAMESMCNELLIS

Parallelism in the Standard C++: What to Expect in C++ 17

Determinis)c+Parallel+Algorithms+ and+Programming+blelloch/papers/ec2-11.pdf · parallel Deterministic parallelism General parallelism EC2, 2011 2 ! Parallelism: using multiple processors/cores

Exposing and exploiting internal parallelism Gregory R ...reports-archive.adm.cs.cmu.edu/anon/home/ftp/2003/CMU-CS-03-12… · performance features of using this device parallelism