C++ AMP: Accelerated Massive Parallelism in Visual C++ 11 Kate Gregory Gregory Consulting @gregcons

Preview:

Citation preview

C++ AMP: Accelerated Massive Parallelism in Visual C++ 11Kate GregoryGregory Consultingwww.gregcons.com/kateblog, @gregcons

C++ is the language for performance

If you need speed at runtime, you use C++Frameworks and libraries can make your code faster

Eg PPL: use all the CPU coresWith little or no change to your logic and code

Experienced C++ developers are productive in C++Don’t want to go back to C or C-like languageEnjoy the tool support in Visual Studio

Many C++ developers value portabilityWrite standard C++, compile with anythingUse portable libraries, run anywhereEven in all-Microsoft universe, simple deployment is important

demo

Cartoonizer

What is C++ AMP?Accelerated Massive Parallelism

Run your calculations on one or more acceleratorsToday, GPU is the accelerator you useEventually: other kinds of accelerators

Write your whole application in C++Not a “C-like” language or a separate resource you link inUse Visual Studio and familiar toolsSpeed up 20x, 50x, or more

Basically a libraryComes with Visual Studio 2012, included in vcredistSpec is open – other platforms/compilers can implement it too

Agenda

Hardware ReviewC++ AMP FundamentalsTilingDebugging and VisualizingCall to Action

CPUs vs GPUs today

CPU

Low memory bandwidthHigher power consumptionMedium level of parallelismDeep execution pipelinesRandom accessesSupports general codeMainstream programming

GPU

High memory bandwidthLower power consumptionHigh level of parallelismShallow execution pipelinesSequential accessesSupports data-parallel codeNiche programming

images source: AMD

C++ AMP is fundamentally a library

Comes with Visual C++ 2012#include <amp.h>Namespace: concurrencyNew classes:

array, array_viewextent, indexaccelerator, accelerator_view

New function(s): parallel_for_each()New (use of) keyword: restrict

Asks compiler to check your code is ok for GPU (DirectX)

parallel_for_each

Entry point to the libraryTakes number (and shape) of threads neededTakes function or lambda to be done by each thread

Must be restrict(amp) Sends the work to the accelerator

Scheduling etc handled thereReturns – no blocking/waiting

void AddArrays(int n, int * pA, int * pB, int * pSum){

for (int i=0; i<n; i++)

{ pSum[i] = pA[i] + pB[i]; }

}

#include <amp.h>using namespace concurrency;

void AddArrays(int n, int * pA, int * pB, int * pSum){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pSum); parallel_for_each( sum.extent, [=](index<1> i) restrict(amp) { sum[i] = a[i] + b[i]; } );}

Hello World: Array Addition

void AddArrays(int n, int * pA, int * pB, int * pSum){

for (int i=0; i<n; i++)

{ pSum[i] = pA[i] + pB[i]; }

}

Basic Elements of C++ AMP codingvoid AddArrays(int n, int * pA, int * pB, int * pSum){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pSum); parallel_for_each(

sum.extent, [=](index<1> i) restrict(amp) { sum[i] = a[i] + b[i];

} );}

array_view variables captured and associated data copied to accelerator (on demand)

restrict(amp): tells the compiler to check that this code conforms to C++ AMP language restrictions

parallel_for_each: execute the lambda on the accelerator once per thread

extent: the number and shape of threads to execute the lambda

index: the thread ID that is running the lambda, used to index into data

array_view: wraps the data to operate on the accelerator

extent<N> - size of an N-dim space

index<N> - an N-dimensional point

View on existing data on the CPU or GPUDense in least significant dimensionOf element T and rank NRequires extentRectangularAccess anywhere (implicit sync)

vector<int> v(10);

extent<2> e(2,5); array_view<int,2> a(e, v);

array_view<T,N>

//above two lines can also be written//array_view<int,2> a(2,5,v);

index<2> i(1,3);

int o = a[i]; // or a[i] = 16;//or int o = a(1, 3);

demo

Matrix Multiplication

Matrix Multiplication

C00 = A00 * B00 + A01 * B10 + A02 * B20 + A03 * B30

restrict(amp) restrictions

Can only call other restrict(amp) functionsAll functions must be inlinableOnly amp-supported types

int, unsigned int, float, double, boolstructs & arrays of these types

Pointers and ReferencesLambdas cannot capture by reference, nor capture pointersReferences and single-indirection pointers supported only as local variables and function arguments

restrict(amp) restrictions

No recursion'volatile'virtual functionspointers to functionspointers to member functionspointers in structspointers to pointersbitfields

No goto or labeled statementsthrow, try, catchglobals or staticsdynamic_cast or typeidasm declarationsvarargsunsupported types

e.g. char, short, long double

vector<int> v(8 * 12);extent<2> e(8,12);accelerator acc = …array<int,2> a(e,acc.default_view);copy_async(v.begin(), v.end(), a);

array<T,N>

Multi-dimensional array of rank N with element TContainer whose storage lives on a specific acceleratorCapture by reference [&] in the lambdaExplicit copyNearly identical interface to array_view<T,N>

parallel_for_each(e, [&](index<2> idx) restrict(amp) { a[idx] += 1;});copy(a, v.begin());

Tiling

Rearrange algorithm to do the calculation in tilesEach thread in a tile shares a programmable cache

tile_static memoryAccess 100x as fast as global memoryExcellent for algorithms that use each piece of information again and again

Overload of parallel_for_each that takes a tiled extent

Race Conditions in the Cache

Because a tile of threads shares the programmable cache, you must prevent race conditions

Tile barrier can ensure a waitTypical pattern:

Each thread does a share of the work to fill the cacheThen waits until all threads have done that workThen uses the cache to calculate a share of the answer

Visual Studio 2012

DebuggingEverything you had before, plus: GPU ThreadsParallel StacksParallel Watch

Visualizing

Debugging

Can hit either CPU breakpoints or GPU breakpointsNew UI in VS 2012 to control that

GPU breakpoints: Only on Windows 8 today

Debugging Properties

Values, Call Stacks, etc

GPU Threads Window

Shows progress through the calculation

Parallel Watch

Shows values across multiple threads

And more!

Race Condition Detection Parallel StacksFlagging, Filtering, and GroupingFreezing and ThawingRun Tile to Cursor

Concurrency Visualizer

Shows activity on CPU and GPUCan highlight relative times for specific parts of a calculationOr copy times to/from the accelerator

C++ AMP is…

C++The language you knowExcellent productivity The language you choose when performance matters

Implemented as (mostly) a libraryVariety of application types

Well supported by Visual Studio 2012DebuggerConcurrency VisualizerEverything else you already use

Can be supported by other compilers and platformsOpen spec

Learn C++ AMP

book http://www.gregcons.com/cppamp/

training http://www.acceleware.com/cpp-amp-training

videos http://channel9.msdn.com/Tags/c++-accelerated-massive-parallelism

articles http://blogs.msdn.com/b/nativeconcurrency/archive/2012/04/05/c-amp-articles-in-msdn-magazine-april-issue.aspx

samples http://blogs.msdn.com/b/nativeconcurrency/archive/2012/01/30/c-amp-sample-projects-for-download.aspx

guides http://blogs.msdn.com/b/nativeconcurrency/archive/2012/04/11/c-amp-for-the-cuda-programmer.aspx

spec http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/03/c-amp-open-spec-published.aspx

forum http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads

http://blogs.msdn.com/nativeconcurrency/

Call to Action

Get Visual Studio 2012Download some samplesPlay with debugger and other toolsTry writing a C++ AMP application of your own

Console (command prompt)WindowsMetro style for Windows 8

Measure your performance and see the difference

Related Content

Breakout Sessions (session codes and titles)

Hands-on Labs (session codes and titles)

Product Demo Stations (demo station title and location)

Related Certification Exam

Find Me Later At…

Required Slide*delete this box when your slide is finalized

Speakers, please list the Breakout Sessions, Labs, Demo Stations and Certification Exams that relate to your session. Also indicate when they can find you staffing in the TLC.

Track Resources

Resource 1

Resource 2

Resource 3

Resource 4

Required Slide *delete this box when your slide is finalized

Track PMs will supply the content for this slide, which will be inserted during the final scrub.

Resources

Connect. Share. Discuss.

http://northamerica.msteched.com

Learning

Microsoft Certification & Training Resources

www.microsoft.com/learning

TechNet

Resources for IT Professionals

http://microsoft.com/technet

Resources for Developers

http://microsoft.com/msdn

Complete an evaluation on CommNet and enter to win!

MS Tag

Scan the Tagto evaluate thissession now onmyTechEd Mobile

Required Slide *delete this box when your slide is finalized

Your MS Tag will be inserted here during the final scrub.

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to

be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS

PRESENTATION.

Recommended