25

Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD

Embed Size (px)

DESCRIPTION

Overview of the Bolt C++ Standard Template Library for HSA Programing

Citation preview

Page 1: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Page 2: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

BOLT: A C++ TEMPLATE LIBRARY FOR HSA Ben Sander AMD Senior Fellow

Page 3: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

3 | BOLT | June 2012

MOTIVATION

§ Improve developer productivity

–  Optimized library routines for common GPU operations –  Works with open standards (OpenCL™ and C++ AMP)

–  Distributed as open source

§ Make GPU programming as easy as CPU programming –  Resemble familiar C++ Standard Template Library

–  Customizable via C++ template parameters –  Leverage high-performance shared virtual memory

§ Optimize for HSA

–  Single source base for GPU and CPU –  Platform Load Balancing

C++ Template Library For HSA

Page 4: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

4 | BOLT | June 2012

AGENDA

§ Introduction and Motivation § Bolt Code Examples for C++ AMP and OpenCL™ § ISV Proof Point § Single source code base for CPU and GPU § Platform Load Balancing § Summary

Page 5: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

5 | BOLT | June 2012

SIMPLE BOLT EXAMPLE #include <bolt/sort.h> #include <vector> #include <algorithm> void main() { // generate random data (on host) std::vector<int> a(1000000); std::generate(a.begin(), a.end(), rand); // sort, run on best device bolt::sort(a.begin(), a.end()); }

§ Interface similar to familiar C++ Standard Template Library

§ No explicit mention of C++ AMP or OpenCL™ (or GPU!) –  More advanced use case allow programmer to supply a kernel in C++ AMP or OpenCL™

§ Direct use of host data structures (ie std::vector)

§ bolt::sort implicitly runs on the platform –  Runtime automatically selects CPU or GPU (or both)

Page 6: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

6 | BOLT | June 2012

BOLT FOR C++ AMP : USER-SPECIFIED FUNCTOR #include <bolt/transform.h> #include <vector> struct SaxpyFunctor { float _a; SaxpyFunctor(float a) : _a(a) {}; float operator() (const float &xx, const float &yy) restrict(cpu,amp) { return _a * xx + yy; }; }; void main() { SaxpyFunctor s(100); std::vector<float> x(1000000); // initialization not shown std::vector<float> y(1000000); // initialization not shown std::vector<float> z(1000000); bolt::transform(x.begin(), x.end(), y.begin(), z.begin(), s); };

Page 7: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

7 | BOLT | June 2012

§ Functor (“a * xx + yy”) now specified inline § Can capture variables from surrounding scope (“a”) – eliminate boilerplate class

BOLT FOR C++ AMP : LEVERAGING C++11 LAMBDA

#include <bolt/transform.h> #include <vector> void main(void) { const float a=100; std::vector<float> x(1000000); // initialization not shown std::vector<float> y(1000000); // initialization not shown std::vector<float> z(1000000); // saxpy with C++ Lambda bolt::transform(x.begin(), x.end(), y.begin(), z.begin(), [=] (float xx, float yy) restrict(cpu, amp) { return a * xx + yy; }); };

Page 8: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

8 | BOLT | June 2012

BOLT FOR OPENCL™

#include <clbolt/sort.h> #include <vector> #include <algorithm> void main() { // generate random data (on host) std::vector<int> a(1000000); std::generate(a.begin(), a.end(), rand); // sort, run on best device clbolt::sort(a.begin(), a.end()); }

§ Interface similar to familiar C++ Standard Template Library § clbolt uses OpenCL™ below the API level

–  Host data copied or mapped to the GPU

–  First call to clbolt::sort will generate and compile a kernel

§ More advanced use case allow programmer to supply a kernel in OpenCL™

Page 9: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

9 | BOLT | June 2012

BOLT FOR OPENCL™ : USER-SPECIFIED FUNCTOR

#include <clbolt/transform.h> #include <vector> BOLT_FUNCTOR(SaxpyFunctor, struct SaxpyFunctor { float _a; SaxpyFunctor(float a) : _a(a) {}; float operator() (const float &xx, const float &yy) { return _a * xx + yy; }; }; ); void main2() { SaxpyFunctor s(100); std::vector<float> x(1000000); // initialization not shown std::vector<float> y(1000000); // initialization not shown std::vector<float> z(1000000); clbolt::transform(x.begin(), x.end(), y.begin(), z.begin(), s); };

§ Challenge: OpenCL™ split-source model –  Host code in C or C++

–  OpenCL™ code specified in strings

§ Solution: –  BOLT_FUNCTOR macro creates both host-side

and string versions of “SaxpyFunctor” class definition §  Class name (“SaxpyFunctor”) stored in TypeName trait

§  OpenCL™ kernel code (SaxpyFunctor class def) stored in ClCode trait.

–  Clbolt function implementation §  Can retrieve traits from class name

§  Uses TypeName and ClCode to construct a customized transform kernel

§  First call to clbolt::transform compiles the kernel

–  Advanced users can directly create ClCode trait

Page 10: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

10 | BOLT | June 2012

BOLT: C++ AMP VS. OPENCL™

BOLT for C++ AMP § C++ template library for HSA

–  Developer can customize data types and operations

–  Provide library of optimized routines for AMD GPUs.

§ C++ Host Language

§ Kernels marked with “restrict(cpu, amp)”

§ Kernels written in C++ AMP kernel language

–  Restricted set of C++

§ Kernels compiled at compile-time

§ C++ Lambda Syntax Supported

§ Functors may contain array_view

§ Parameters can use host data structures (ie std::vector)

§ Parameters can be array or array_view types

§ Use “bolt” namespace

BOLT for OpenCL™ § C++ template library for HSA

–  Developer can customize data types and operations

–  Provide library of optimized routines for AMD GPUs.

§ C++ Host Language

§ Kernels marked with “BOLT_FUNCTOR” macro

§ Kernels written in OpenCL™ kernel language

–  Subset of C99, with extensions (ie vectors, builtins)

§ Kernels compiled at runtime, on first call

–  Some compile errors shown on first call

§ C++11 Lambda Syntax NOT supported

§ Functors may not contain pointers

§ Parameters can use host data structures (ie std::vector)

§ Parameters can be cl::Buffer or cl_buffer types

§ Use “clbolt” namespace

Page 11: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

11 | BOLT | June 2012

§ Optimized template library routines for common GPU functions –  For OpenCL™ and C++ AMP, across multiple platforms

§ Direct interfaces to host memory structures (ie std::vectors) –  Leverage HSA unified address space and zero-copy memory –  C++ AMP array and cl::Buffer also supported if memory already on device

§ Bolt submits to the entire platform rather than a specific device –  Runtime automatically selects the device

–  Provides opportunities for load-balancing

–  Provides optimal CPU path if no GPU is available. –  Override to specify specific accelerator is supported

–  Enables developers to fearlessly move to the GPU

§ Bolt will contain new APIs optimized for HSA Devices –  Multi-device bolt::pipeline, bolt::parallel_filter

BOLT : WHAT’S NEW?

Page 12: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

12 | BOLT | June 2012

§ “Hessian” kernel from “MotionDSP Ikena” –  Commercially available video enhancement software

–  Optimized for CPU and GPU

§ Basic Hessian Algorithm –  Two input images I and W –  Transform, followed by reduce (“transform_reduce”)

§ For each pixel in image, compute 14 float coefficients

§ Sum the coefficients for all the pixels– final result is 14 floats

–  Complex, computationally intense, real-world algorithm

§ Developed multiple implementations of Hessian kernel –  CPU, GPU, Bolt

EXAMPLARY ISV PROOF-POINT

Hessian Algorithm Pseudo Code: // x,y are coordinates of pixel to transform // Pixel difference: It = W(y, x) - I(y, x); // average left/right pixels: Ix = 0.5f *( W(y, x+1) - W(y, x-1) ); // average top/bottom pixels: Iy = 0.5f*( W(y+1, x) - W(y-1, x) ); X = x dist of this pixel from center Y = y dist of this pixel from center … // Compute for each pixel: H[ 0] = (Ix*X+Iy*Y) * (Ix*X+Iy*Y) H[ 1] = (Ix*X-Iy*Y) * (Ix*X+Iy*Y) H[ 2] = (Ix*X-Iy*Y) * (Ix*X-Iy*Y) H[ 3] = (Ix ) * (Ix*X+Iy*Y) H[ 4] = (Ix ) * (Ix*X-Iy*Y) H[ 5] = (Ix ) * (Ix ) H[ 6] = (Iy ) * (Ix*X+Iy*Y) H[ 7] = (Iy ) * (Ix*X-Iy*Y) H[ 8] = (Iy ) * (Ix ) H[ 9] = (Iy ) * (Iy ) H[10] = (It ) * (Ix*X+Iy*Y) H[11] = (It ) * (Ix*X-Iy*Y) H[12] = (It ) * (Ix ) H[13] = (It ) * (Iy )

Page 13: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

13 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012

0  

50  

100  

150  

200  

250  

300  

350  

LOC

LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS

Copy-back Algorithm Launch Copy Compile Init Performance

Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt

Relative Perform

ance

35.00  

30.00  

25.00  

20.00  

15.00  

10.00  

5.00  

0  Copy-back

Algorithm

Launch

Copy

Compile

Init.

Copy-back

Algorithm

Launch

Copy

Compile

Copy-back

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

Algorithm

Launch

(Exemplary ISV “Hessian” Kernel)

Page 14: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

14 | BOLT | June 2012

PERFORMANCE PORTABILITY - INTRODUCTION

§ For many algorithms, core operation same between CPU and GPU

–  See sort, saxpy, hessian examples –  Same Core Operation

–  Differences in how data is routed to the core operation

§ Bolt hides the device-specific routing details inside the library function implementation –  GPU implementations:

§ GPU-friendly data strides

§ Launch enough threads to hide memory latency

§ Group Memory and work-group communication

–  CPU implementations: § CPU-friendly data strides

§ Launch enough threads to use all cores

Page 15: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

15 | BOLT | June 2012

PERFORMANCE PORTABILITY – RESULTS

0.00  

0.50  

1.00  

1.50  

2.00  

2.50  

3.00  

3.50  

4.00  

4.50  

Serial  CPU   TBB  CPU   OpenCL  (CPU)   HSA  Bolt  (CPU)  

Rel  Perf  

 CPU  Performance  vs  Programming  Model  

(Exemplary  ISV  "Hessian"  Kernel")      

Page 16: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

16 | BOLT | June 2012

PERFORMANCE PORTABILITY – WHAT’S NEW ?

§ New GPU programming models are close to CPU programming models

–  C++ AMP : Single-source, (restricted) C++11 kernel language, high-quality debugger/profiler, etc § Shared Virtual Memory in HSA

–  Removes tedious copies between address spaces –  Will allow use of complex pointer-containing data structures

§ Less performance cliffs in modern GPU architectures (ie AMD GCN) –  Reduce need for GPU-specific optimizations in core operation

–  Example: 14:7:1 Bandwidth Ratio for Group:Cache:Global Memory § Autovectorization

–  Modern compilers include auto-vectorization support –  Restrictions of GPU programming models facilitate vectorization

§ Finally, Bolt functors can provide device-specific implementations if needed

Page 17: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

17 | BOLT | June 2012

HSA LOAD BALANCING : KEY FEATURES AND OBSERVATIONS

§ High-performance shared virtual memory

–  Developers no longer have to worry about data location (ie device vs host)

§ HSA platforms have tightly integrated CPU and GPU

–  GPU better at wide vector parallelism, extracting memory bandwidth, latency hiding

–  CPU better at fine-grained vector parallelism, cache-sensitive code, control-flow

§ Bolt Abstractions

–  Provides insight into the characteristics of the algorithm

§ Reduce vs Transform vs parallel_filter

–  Abstraction above the details of a “kernel launch”

§ Don’t need to specify device, workgroup shape, work-items, number of kernels, etc

§ Runtime may optimize these for the platform

§ Bolt has access to both optimized CPU and GPU implementations, at the same time

–  Let’s use both!

§ Let’s use both!

Page 18: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

18 | BOLT | June 2012

EXAMPLES OF HSA LOAD-BALANCING

Example   DescripBon   Exemplary  Use  Cases  

Data  Size   Run  large  data  sizes  on  GPU,  small  on  CPU   Same  call-­‐site  used  for  varying  data  sizes.  

Parallel_filter  

GPU  scans  all  candidates  and  rejects  early  mismatches;  CPU  more  deeply  evaluates  the  survivors.   Haar  detector,  word  search,  audio  search.  

Heterogeneous  Pipeline  

Run  a  pipelined  series  of  user-­‐defined  stages.    Stages  can  be  CPU-­‐only,  GPU-­‐only,  or  CPU  or  GPU.   Video  processing  pipeline.  

PlaUorm  Super-­‐Device  

Distribute  workgroups  to  available  processing  units  on  the  enWre  plaUorm.  

Kernel  has  similar  performance  /energy  on  CPU  and  GPU.  

Border/Edge  OpWmizaWon  

Run  wide  center  regions  on  GPU,  run  border  regions  on  CPU.       Image  processing.  

ReducWon  Run  iniWal  reducWon  phases  on  GPU,  run  final  stages  on  CPU   Any  reducWon  operaWon.  

Page 19: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

19 | BOLT | June 2012

HETEROGENEOUS PIPELINE § Mimics a traditional manufacturing assembly line

–  Developer supplies a series of pipeline stages –  Each stage processes it’s input token, passes an output token to the next stage

–  Stages can be either CPU-only, GPU-only, or CPU/GPU § CPU/GPU tasks are dynamically scheduled

–  Use queue depth and estimated execution time to drive scheduling decision –  Adapt to variation in target hardware or system utilization

–  Data location not an issue in HSA –  Leverage single source code

§ GPU kernels scheduled asynchronously –  Completion invokes next stage of the pipeline

§ Simple Video Pipeline Example: Video Decode

(CPU-only)

Video Processing (CPU/GPU)

Video Render

(GPU-only)

Page 20: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

20 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012

CASCADE DEPTH ANALYSIS

0

5

10

15

20

25 Cascade Depth

20-25

15-20

10-15

5-10

0-5

Page 21: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

21 | BOLT | June 2012

PARALLEL_FILTER

§ Target applications with a “Filter” pattern

–  Filter out a small number of results from a large initial pool of candidates

–  Initial phases best run on GPU:

§  Large data sets (too big for caches), wide vector, high-bandwidth

–  Tail phases best run on CPU

§ Smaller data sets (may fit in cache), divergent control flow, fine-grained vector width

–  Examples: Haar detector, word search, acoustic search

§ Developer specifies:

–  Execution Grid

–  Iteration state type and initial value

–  Filter function

§ Accepts a point to process and the current iteration state

§ Return True to continue processing or False to exit

§ BOLT / HSA Runtime

–  Automatically hands off work between CPU and GPU

–  Balances work by adjusting the split point between GPU and CPU

Page 22: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

22 | BOLT | June 2012

SUMMARY

§ Bolt: C++ Template Library

–  Optimized GPU and HSA Library routines –  Customizable via templates

–  For both OpenCL™ and C++ AMP

§ Enjoy the unique advantages of the HSA Platform –  High-performance shared virtual memory

–  Tightly integrated CPU and GPU

§ Enable advanced HSA features –  A single source base for CPU and GPU

–  Platform load balancing across CPU and GPU

C++ Template Library For HSA

Page 23: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

23 | BOLT | June 2012

BACKUP

Page 24: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

24 | BOLT | June 2012

BENCHMARK CONFIGURATION INFORMATION

§ Slide13, 15

–  AMD A10-5800K APU with Radeon™ HD Graphics § CPU: 4cores, 3800Mhz (4200Mhz Turbo)

§ GPU: AMD Radeon™ HD 7660D, 6 compute units, 800Mhz

§ 4GB RAM

–  Software: § Windows 7 Professional SP1 (64-bit OS)

§ AMD OpenCL™ 1.2 AMD-APP (937.2)

§ Microsoft Visual Studio 11 Beta

Page 25: Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD

25 | BOLT | June 2012

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. [For AMD-speakers only] © 2012 Advanced Micro Devices, Inc. [For non-AMD speakers only] The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.