12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
1
Petapath
Dairsie Latimer and Michal Harasimiuk
Programming for High Performance Accelerated Systems
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
2
Petapath
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
3
Petapath
Hardware
Software
Tools
Consulting
ClearSpeed
NVIDIA
AMD
Intel
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
4
PetapathJoint Petapath/HP PRACE WP8 Prototype system at SARA/NCF
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
5
PetapathJoint Petapath/HP PRACE WP8 Prototype system at SARA/NCF
6U
10 TFLOPS
7 kW
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
6
Petapath
• 20 racks, 1.125 PFLOPS, end of 2009• 500KW• Alternative systems – 15x the size, 8x the power
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
7
Programming for High Performance Accelerated Systems
• Overview of the development environment at SARA/NCF
• Options for programming heterogeneous systems
• Moving software development flows from multi-core to heterogeneous systems
• Developing with OpenCL going forward
Petapath
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
8
PetapathPetapath/HP PRACE Prototype system at SARA/NCF
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
9
ClearSpeed Software Development environment at SARA/NCF
• ClearSpeed SDK Version 3.1• Binary compatible across all ClearSpeed based products
• Cn Optimising Compiler• C with poly extensions for SIMD data types
• Debugger – a port of gdb• Runs on hardware
• Profiler – csprof• Allows system-wide visualization of an accelerated
application’s performance while running on both a multi-core host and ClearSpeed accelerators
• Libraries (BLAS, RNG and FFT) & High level APIs (CSPX)
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
10
• Standard Eclipse graphical debug interface for CSX processors
• CSX processors provide full hardware debugging of running application code
• Provides seamless viewof many processor coresin parallel with their associated state
• Allows full symbolic debug of the Cn language
• Enhanced views for CSX specific information
ClearSpeed graphical debug interface for the heterogeneous systems
Images used with permission of ClearSpeed Technology Plc
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
11
ClearSpeed profiler for heterogeneous and multi-processor systems
Advance™ Accelerator Board
CSX 600
Pipeline
CSX 600
Pipeline
HostCPU(s)Host
CPU(s)Host
CPU(s)
Advance™ Accelerator Board
HostCores(s)
CSX
Pipeline
HOST/BOARD INTERACTIONView host/board interactions.
Provides performance information for data transfer
operations. Trace cluster node/board interaction. See overlap of host compute and
board compute.
CSX PIPELINEView detailed instruction
issue information. Visualize overlap of executing
instructions. Optimize code at the instruction level. View
instruction level performance bottlenecks. Get accurate
instruction timing.
CSX SYSTEMView system level trace.
Visually inspect the overlap of compute and
I/O. Visualize cache utilization. View branch trace of code executing.
Find and analyse performance bottlenecks. Get accurate event timing
CSX
Pipeline
HOST CODE PROFILINGVisually inspect host code
executing. Supports multiple threads
and processes. Time specific code sections.
See overlap of host threads executing.
Platform and processor agnostic trace collection.
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
12
PetapathProgramming for High Performance Accelerated Systems
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
13
Programming for High PerformanceAccelerated Systems
Introduction
• Heterogeneous systems are now increasingly common
• They are being adopted at the top (Top500) and the bottom (technical workstation) of the HPC market
• Acceleration can deliver significant performance and cost savings over traditional COTS HPC systems
• However, there are real barriers to adoption:• Software support and programming models• Host system requirements
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
14
• In order to take advantage of this new technology trend, what are the realistic options?
• Some important things to consider:
• Single or multi-use system?
• Where do the majority of the cycles go?
• ISV codes or Open Source/Custom Codes?• Sufficient development resources?
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
15
• Starting with application source, what is the best way to target heterogeneous computing today?
• Proprietary development environments and hardware:• AdvanceTM/Cn (ClearSpeed)• TeslaTM/CUDA (NVIDIA)• StreamTM/Brook+ (AMD)• FPGA based solutions
• Or via Third Party/Middleware:• RapidMindTM Platform• CAPS HMPPTM
• PGI’s x64+GPU Accelerate Model• e.g. Mitrion Development Platform for FPGAs
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
16
• These options can loosely be categorised into
• Language• Cn, CUDA, Brook+, Mitrion C, OpenCL
• Directive based or hybrid approaches• PGI x86+GPU, CAPS HMPP, RapidMind
• Allow re-targetable support• Can potentially support multiple vendor development
environments
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
17
• Library
• All the languages have a library component• Manages hardware resources and runtime interaction
• Can also provide higher level abstractions suchas standard library support, e.g. BLAS or LAPACK
• Some libraries are available from third parties that are designed to transparently interface ISV applications to accelerator hardware
• Often the best implementations are available from the vendors themselves
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
18
• Industry will inevitably move towards available open standards
• We believe that the Khronos Group’s OpenCLTM (Open Computing Language) will be a key enabler in the wider adoption of heterogeneous systems
• Petapath are members of the Khronos Group and participants on the OpenCL working group
What comes next?
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
19
• OpenCL is the first open, royalty-free standard for general-purpose parallel programming of heterogeneous systems
• OpenCL provides uniform programming environment for software developers
• Can write efficient, portable code for a range of high-performance systems and a diverse mix of multi-core and parallel processors
OpenCL
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
20
• OpenCL consists of:• An API for coordinating parallel computation• A programming language for describing those
computations.
• Specifically, the OpenCL standard defines:• Subset of the C99 language with extensions for
parallelism• API for coordinating data and task-based parallel
computations• Numerical requirements based on the IEEE 754 standard• Interoperability with other Khronos standards such as
OpenGL• An abstraction layer for a diverse range of computational
resources
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
21
• OpenCL also specifies:• A rich set of built-in functions• Online or offline compilation and build of compute kernel
executables
• Platform Layer API• Query, select and initalize compute devices• Create compute contexts and work-queues
• Runtime API• Execute compute kernels• Manage scheduling, compute and memory resources
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
22
• Is OpenCL a golden bullet?• Possibly not but it’s an excellent place to start
• It’s a well supported Open standard• OpenCL has complete cross vendor support• Most are motivated to increase their market share in the
HPC and Technical computing market
• Write once, work on many platforms is attractive for ISVs• The lack of an open standard has certainly slowed
adoption of support for heterogeneous systems outside of the academic community for compute intensive applications
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
23
• When will it be available?• KhronosTM Group ratified the OpenCLTM 1.0 specification
at Siggraph Asia, December 9th 2008• Conformant vendor implementations available in Q3
2009• One vendor already has a public beta program• Others will not be far behind
• What are the principle reasons that make OpenCL attractive?• No reliance on proprietary programming languages
• Cross vendor compatibility and interoperability
• Cross platform support (Linux, Windows and OS X)
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
24
• The incentive to support heterogeneous systems has to be a clear business win; so companies who differentiate on innovation are more likely to adopt early
• Many large ISVs have long development cycles and if their licensing model is core or socket based they will have to revise their charging structures
• Heterogeneous computing won’t really hit mainstream, multi-application HPC market without ISV support
Observations
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
25
PetapathSoftware development flows onmulti-core and heterogeneous systems
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
26
Host Software Development Practice (Single Core)
• Typical host development flow (Rinse, Profile, Repeat)
• Use a naïve implementation (e.g. the infamous triple loop)
• Compile (compiler choice can often be important)• Profile/Benchmark (use % of peak GFLOPS as a guide)• Throw some compiler switches• Repeat
• Some developers don’t get very far into this optimisation process
• Time vs Reward (Does it run fast enough yet?)
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
27
Host Software Development Practice (Multi-core)
• Look for more scalable implementations• In the multi-core era look for algorithmically scalable
solutions• This usually means looking to leverage architectural
features• e.g. Make sure you are cache friendly and take advantage
of Vector/SIMD support
• Compile, Profile/Benchmark (use % of peak GFLOPS as a guide)
• Throw compiler switches but also use compiler directives e.g. OpenMP which can require some changes to code
• The parameter space for these optimisations can be large
• Challenging even for the experienced
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
28
Host Software Development Practice (Pitfalls)
• The ‘memory wall’ is probably the biggest hurdle
• With more cores sharing an already scarce resource in main memory bandwidth, cache hygiene is very important!
• Once you fall out of cache then it is sometimes possible that adding more cores can actually slow down your application
• Effective programming is about optimising bandwidth• Tools such as Acumem’s SlowSpotterTM are particularly
useful!
• Deliberately skipping multi-node development as it’s a whole other subject and deserves it’s own track
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
29
Heterogeneous Systems Software Development Practice
• An implementation tuned for multi-core is a good starting point for porting to an accelerated system
• This is because available concurrency (via multi-threading and asynchronous operations) and data parallel operations will likely have been explicitly exposed
• In all but the most compute bound applications, effective implementations of data parallel problems are usually tuned to maximise cache bandwidth
• And to allow effective loop blocking and strip-mining transformations
• This set of optimisations provide a good template for developing an algorithm on an accelerated system
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
30
• First ascertain at what limiting factors on the host are:
• Bandwidth• Is your application bandwidth limited on the host?
• In its most cache/memory friendly implementation does it scale and exhibit good cache behaviour?
• GPU based accelerators have several times the BW to their local memories of even the latest servers• However accelerators typically have less local memory
than the host so large working sets will have to be streamed from the host
• Any significant and repeated data movement to and from the accelerator can often be a gating factor for overall application acceleration
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
31
• Compute• Is you application compute limited?
• Is it single precision?• Single precision is still the clear advantage for GPU based
accelerators
• Does your application require double precision?• GPU based accelerators have less of a delta over x86 hosts
in terms of pure DP performance • Lower GFLOP/$ and GFLOP/W vs Implementation Complexity
• ClearSpeed has significant advantages in terms of GFLOP/W for applications needing double precision
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
32
• As for optimal multi-core development
• Make sure you are making the most of the architectural features
• Occupancy vs. Latency hiding• Shared or local memory accesses
• Consider using other memories (constant, texture etc.)
• Make sure you are maximising external memory bandwidth • Correct alignment and granularity vital• Must used coalesced memory accesses
General comments on using accelerators
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
33
Accelerator Software Development Pitfalls
• Pay attention to Amdahl’s Law
• Simply put it describes the limit of potential acceleration of an application due to parallelisation
• Applies equally to many multi-core implementations
• As you process the data parallel kernels faster, the data movement and other serial portions of the application start to dominate the actual runtime
• At this point the host interface to the accelerator can now be a bottleneck
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
34
PetapathThe future - Developing with OpenCL
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
35
OpenCL in use
• The Khronos Group’s conformance requirements for OpenCL will endeavour to ensure correctness of implementation between vendors
• A real challenge for those using OpenCL could well be managing varying performance characteristics of different OpenCL capable platforms
• Even different products by the same vendor may vary
• What works well on a multi-core CPU and efficiently on a massively parallel accelerator will likely vary
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
36
• How similar is the heterogeneous development environment to traditional host development?
• What tools are there to help the development process?
• Do they all support a similar debug interface?• Do they all have similar profiling capabilities?
Will development methods and tools converge?
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
37
• Debug• Hardware gdb support?
• ClearSpeed supports source level debug of Cn
• NVIDIA in CUDA 2.1, CELL• Debug for Brook+ & pre-CUDA 2.1 was via host versions of
kernels
• Profiling• gprof (supported by ClearSpeed in Cn)
• Host API only support for gprof with NVIDIA
• Hardware profiling?• ClearSpeed has a very sophisticated profiling and debugging
environment• Other profilers currently report a more limited set of information
for kernels running on HW
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
38
• Will these debug and profile tools support OpenCL out of the gate?
• With an open development environment now available, it makes sense to develop cross-platform tools that support OpenCL natively and more importantly across multiple vendors and operating systems
• Not having to use vendor specific tools will increase the likelihood that developers will not spend too much time tuning for each platform
What will OpenCL have initially?
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
39
ClearSpeed CSX700
All Image Rights reserved by original copyright holders
Architectures targeted by OpenCL are similar, but different …
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
40
NVIDIA GT200
Image Rights reserved by original copyright holders
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
41
AMD RV770
Image Rights reserved by original copyright holders
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
42
INTEL LARRABEE
Image Rights reserved by original copyright holders
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
43
• Additional utilities and development tools available to the host based developer:
• Intel® Compilers, MKL, IPP, VTune, Thread Building Blocks, Thread Checker (and soon Parallel Studio)
• AMD Partner Compilers, CodeAnalyst, ACML
• Acumem SlowSpotter
• Allinea Tools
• And a myriad of other third party tools …
Can we look forward to …
12/04/23Copyright © 2009 Petapath Limited.All rights reserved.
44
Petapath Questions?