42
Dmitri Nesteruk [email protected] http://activemesa.com http://spbalt.net http://devtalk.net

Unmanaged Parallelization via P/Invoke

Embed Size (px)

Citation preview

Page 1: Unmanaged Parallelization via P/Invoke

Dmitri [email protected]

http://activemesa.comhttp://spbalt.net http://devtalk.net

Page 2: Unmanaged Parallelization via P/Invoke

“Premature optimization is the root of all evil.”

Donald KnuthStructured Programming with go to Statements, ACM

Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.

“In practice, it is often necessary to keep performance goals in mind when first designing software, but the programmer balances the goals of design and optimization.”

Wikipediahttp://en.wikipedia.org/wiki/Program_optimization

Page 3: Unmanaged Parallelization via P/Invoke

Brief introWhy unmanaged code?Why parallelize?

P/InvokeSIMDOpenMPIntel stack: TBB, MKL, IPPGPGPU: Cuda, AcceleratorMiscellanea

Page 4: Unmanaged Parallelization via P/Invoke

Today

Threads & ThreadPoolSync structures

Monitor.(Try)Enter/ExitReaderWriterLock(Slim)MutexSemaphore

Wait handlesManual/AutoResetEvent

Pulse & waitAsync delegatesAsync simplifications

F# async workflowAsyncEnumerator (PowerThreading)

Tomorrow

Tasks and TaskManagerTaskFuture

Data-level parallelismParallel.For/ForEach

Parallel LINQAsParallel()

Page 5: Unmanaged Parallelization via P/Invoke

PerformanceLow-level (fine-tuning) frameworkInstruction-level parallelismGPU SIMDGeneral vectorizationSimple cross-machine framework

Managed interfaces for SIMD/MPI-optimized librariesThreading tools

DebuggingProfilingInferencing

Cross-machine debuggingTask management UI

Page 6: Unmanaged Parallelization via P/Invoke

WTF?!? Isn’t C# 5% faster than C?It depends.

Why is there a difference?More safety (e.g., CLR array bound checking) JIT: No auto-parallelizationJIT: No SIMDLack of fine control

IL can be every bit as fast as C/C++But this is only true for simple problemsThe code is only as good as the JITter

Page 7: Unmanaged Parallelization via P/Invoke

Libraries (MKL, IPP)

OpenMP

Intel TBB, Microsoft PPL

SIMD (CPU & GPGPU)

Page 8: Unmanaged Parallelization via P/Invoke

Part I

Page 9: Unmanaged Parallelization via P/Invoke

A way of calling unmanaged C++ from .NetNot the same as C++/CLI

For interaction with ‘legacy’ systemsCan pass data between managed and unmanaged code

Literals (int, string)Pointers (e.g., pointer to array)Structures

Marshalling is taken care of by the runtime

Page 10: Unmanaged Parallelization via P/Invoke

Make a Win32 C++ DLLMYLIB_API int Add(int first, int second){return first + second;

}Specify a post-build step to copy DLL to .Net assembly

Important: default DLL location is solution root

Build the DLLMake a .Net application

[DllImport("MyLib.dll")]public static extern int Add(int first, int second);

Call the method

Page 11: Unmanaged Parallelization via P/Invoke

Basic C# ↔ C++ Interop

Page 12: Unmanaged Parallelization via P/Invoke

DLL not foundMake sure post-build step copies DLL to target folderOr that DLL is in PATH

An attempt was made to load DLL with incorrect format

DLL relies on other DLLs which are not found

Open Visual Studio command promptUse dumpbin /dependents mylib.dll to find outCopy files to target dirThis is common in Debug mode

32-bit/64-bit mismatch

Entry point not foundMake sure method names and signatures are equivalentMake sure calling convention matches

[DllImport(…, CallingConvention=))

On 64-bit systems, specify entry name explicitly

Use dumpbin /exports[DllImport(…,EntryPoint = "?Add@@YAHHH@Z"

No, extern "C " does not help

It all worksCongratulations!

Page 13: Unmanaged Parallelization via P/Invoke

Special casesString handling

Unicode vs. ANSILP(C)WSTR

Arraysfixed

Memory allocationCalling convention“Bitness” issues

… and lots more!

Handling themMarshalMarshalAsAttribute[In] and [Out]StructLayoutIntPtr… and lots more

Handle on a case-by-case basis

Page 14: Unmanaged Parallelization via P/Invoke

Make sure signatures match

Including return types!To debug

If your OS is 64-bit, make sure .Net assemblies compile in 32-bit modeMake sure unmanaged code debugging is turned on

In 64-bitLaunch target DLL with the .Net assembly as target

Good luck! :)

Visit the P/Invoke wiki @http://pinvoke.net

Page 15: Unmanaged Parallelization via P/Invoke

Part II

Page 16: Unmanaged Parallelization via P/Invoke

An API for multi-platform shared-memory parallel programming in C/C++ and Fortran.Uses #pragma statements to decorate codeEasy!!!

Syntax can be learned very quicklyCan be turned off and on in project settings

Page 17: Unmanaged Parallelization via P/Invoke

Enable it (disabled by default)

Use it!No further action necessary

To use configuration API#include <omp.h>Call methods, e.g., omp_get_num_procs()

Page 18: Unmanaged Parallelization via P/Invoke

void MultiplyMatricesDoubleOMP(int size, double* m1[], double* m2[], double* result[])

{int i, j, k;#pragma omp parallel for shared(size,m1,m2,result) private (i,j,k)for (i = 0; i < size; i++){for (j = 0; j < size; j++){result[i][j] = 0;for (k = 0; k < size; k++){result[i][j] += m1[i][k] * m2[k][j];

}}

}}

#pragma omp parallel forHints to the compiler that it’s worth parallelizing loop

shared(size,m1,m2,result)Variables shared between all threads

private(i,j,k)Variables which have differing values in different threads

Page 19: Unmanaged Parallelization via P/Invoke

Using OpenMP in your C++ app

Page 20: Unmanaged Parallelization via P/Invoke

Homepagehttp://openmp.orgcOMPunity (community of OMP users)http://www.compunity.org/OpenMP debug/optimization article (Russian)http://bit.ly/BJbPUVivaMP (static analyzer for OpenMP code)http://viva64.com/vivamp-tool

Page 21: Unmanaged Parallelization via P/Invoke

Part III

Page 22: Unmanaged Parallelization via P/Invoke

Libraries save you from reinventing the wheelTestedOptimized (e.g., for multi-core, SIMD)

These typically have C++ and Fortran interfaces

Some also have MPI supportOf course, there are .Net libraries too :)

The ‘trick’ is to use these libraries from C#Fortran-compatible API is tricky!Data structure passing can be quite arcane!

Page 23: Unmanaged Parallelization via P/Invoke

Intel makes multi-core processorsMulti-core know-how

Parallel ComposerC++ Compiler (autoparallelization, OpenMP 3.0)Libraries (Math Kernel Library, Integrated Performance Primitives, Threading Building Blocks)Parallel debugger extension

Parallel inspector (memory/threading errors)Parallel amplifier (hotspots, concurrency, locks and waits)Parallel Advisor Lite

Inte

l Par

alle

l Stu

dio

Page 24: Unmanaged Parallelization via P/Invoke

Low-level parallelization framework from IntelLets you fine-tune code for multi-coreIs a library

Uses a set of primitivesHas OSS license

Page 25: Unmanaged Parallelization via P/Invoke

#include "tbb/parallel_for.h"#include "tbb/blocked_range.h"using namespace tbb;struct Average {

float* input;float* output;void operator()( const blocked_range<int>& range ) const {

for( int i=range.begin(); i!=range.end( ); ++i )output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f);

}};// Note: The input must be padded such that input[-1] and input[n]// can be used to calculate the first and last output values.void ParallelAverage( float* output, float* input, size_t n ) {

Average avg;avg.input = input;avg.output = output;parallel_for(blocked_range<int>( 0, n, 1000 ), avg);

}

Functor

Library call

Page 26: Unmanaged Parallelization via P/Invoke

Integrated Performance Primitives

High-performance libraries forSignal processingImage processingComputer visionSpeech recognitionData compressionCryptographyString manipulationAudio processingVideo codingRealistic rendering

Also support codec construction

Math Kernel Library

Optimized, multithreaded library for mathSupport for

BLASLAPACKScaLAPACKSparse SolversFast Fourier TransformsVector Math… and lots more

Page 27: Unmanaged Parallelization via P/Invoke

Part IV

Page 28: Unmanaged Parallelization via P/Invoke

CPU support for performing operations on large registersNormal-size data is loaded into 128-bit registersOperation on multiple elements with a single instruction

E.g., add 4 numbers at onceRequires special CPU instructions

Less portableSupported in C++ via ‘intrinsics’

Page 29: Unmanaged Parallelization via P/Invoke

SSE is an instruction setInitially called MMX (n/a on 64-bit CPUs)Now SSE and SSE2

Compiler intrinsicsC++ functions that map to one or more SSE assembly instructions

Determining supportUse cpuidNon-issue if you are

A systems integratorRun your own servers (e.g., Asp.Net)

Page 30: Unmanaged Parallelization via P/Invoke

128-bit data types__m128__m128i (integer intrinsics)__m128d (double intrinsics)

Operations for load and set__m128 a = _mm_set_ps(1,2,3,4);To get at data, dereference and choose type

E.g., myValue.m128_f32[0] gets first float

Perform operations (add, multiply, etc.)E.g., _mm_mul_ps(first, second)multiplies two values yielding a third

} sse2

Page 31: Unmanaged Parallelization via P/Invoke

Make or get dataEither create with initialized valuesstatic __m128 factor = _mm_set_ps(1.0f, 0.3f, 0.59f, 0.11f);

Or load it into a SIMD-sized memory location__m128 pixel;pixel.m128_f32[0] = s->m128i_u8[(p<<2)];Or convert an existing pointer__m128* pixel = (__m128*)(&my_array + p);

Perform a SIMD operation and get datapixel = _mm_mul_ps(pixel, factor);

Get the dataconst BYTE sum = (BYTE)(pixel.m128_f32[0]);

Page 32: Unmanaged Parallelization via P/Invoke

Image processing with SIMD

Page 33: Unmanaged Parallelization via P/Invoke

Part IV

Page 34: Unmanaged Parallelization via P/Invoke

Graphics cards have GPUsThese are highly parallelized

PipeliningUseful for graphics

GPUs are programmableWe can do math ops on vectorsMainly float, with double support emerging

Page 35: Unmanaged Parallelization via P/Invoke

GPUs have programmable partsVertex shader (vertex position)Pixel shader (pixel colour)

Treat data as texture (render target)Load inputs as textureUse pixel shaderGet data from result texture

Special languages used to program themHLSL (DirectX)GLSL (OpenGL)

High-level wrappers (CUDA, Accelerator)

Page 36: Unmanaged Parallelization via P/Invoke

A Microsoft Research projectNot for commercial use

Uses a managed APIEmploys data-parallel arrays

IntFloatBoolBitmap-aware

Requires PS 2.0

Page 37: Unmanaged Parallelization via P/Invoke

Sorry! No demo.Accelerator does not work on 64-bit :(

Page 38: Unmanaged Parallelization via P/Invoke
Page 39: Unmanaged Parallelization via P/Invoke

If a library already exists, use itIf C# is fast enough, use itTo speed things up, try

TPL/PLINQManual Parallelizationunsafe (can be combined with TPL)

If you are unhappy, thenWrite in C++Speculatively add OpenMP directivesFine-tune with TBB as needed

Page 40: Unmanaged Parallelization via P/Invoke

System.Drawing.Bitmap is slowHas very slow GetPixel()/SetPixel() methods

Can fix bitmap in memory and manipulate it in unmanaged codeWhat we need to pass in

Pointer to bitmap image dataBitmap width and heightBitmap stride (horizontal memory space)

Page 41: Unmanaged Parallelization via P/Invoke

Image-rendered headings with subpixelpostprocessing (http://bit.ly/10x0G8)

WPF FlowDocument for initial generationC++/OpenMP for postprocessingAsp.Net for serving the result

Freeform rendered text with OpenTypefeatures (http://bit.ly/1cCP50)

Bitmap rendering in Direct2D (C++/lightweight COM API)OpenType markup language

Page 42: Unmanaged Parallelization via P/Invoke

a l b v o b q l l k u t m y w m w r e e r q q m q i q d n w g s s w d av p d v n u x j l s y t u b n b y c t h r r y u v a s t a d t n z f f xg q h b j j p y o w s i g i c i i g s o f n f r j f d c f g m k w u y jv b v e m i t i j x u v w s j u g u y l b o c m y k u b w s w n p x i ok a y c q o s u n k s c g x j x j e q p h j i a c m j z h c k v x k a kf e c r u u x q p p k o f w g x b v j m b e l e e w k s c v n n o g c zw w f w i n e h j q l h x u v j o m h g s x a j z b d n u a s c n a j ix w i n w z j d s p n w i p c n d s r m j h z q j g b w j m e z k j v az o u q w d c j c f o x w t h v s r h o m j y n a u p p u p h z n s j rm b z o w k i n t h l i k z w m z m f x c h o m w x b s m x u c j x o sh x u e t p u x e o v l h a y p f f v a x z x l z u l c l n q g e g m xy k k k q j n h p i j w i p d d a x z s z e m p c l i m s u g e i z o mq p r p d w m y q t o v m p T H E y E N D v z d c z x m g q q r h n b ji b q i p x n h w i d o h m a w c x m g h c y r i k n p n d m c x l z eh h s c l f s y l k j s p t d q e b k v u x k m k z p g k e n a f h h ro x v w k u j u t n e u q f a d n e d y y y f c z c a p x y f b r w e yo f a v f h z r y a n z u q r o g n f p x l j y l u a n r d o r v k m fj y n h p c c t k x y t b f j r n x g c z h s p c e i q g x k p f g r nl y i i f t i s b i f c k c h e s l w y s u p d v x b r l q l k i z d zw s a w r i i u m n i x r c j n d h n w g s f s i l h a b h l h x m v pt e g k n o i s g s x v b o k e c i j y b e d r t p e x v r c w u v d sd o a z t t m u i u v u b p l w c p x n k k v a a v b b s e e f d b f yi v c j k r g r y t j a m f v h b f s b z l i n a x c l r l z i v l c bn u d l l g u y r t t u q t l y j l q u h a o u o p t g v l q q r k r qy p l z d x n q n q v t f b u h r y n k f q i t h i u w i n m l o c c c