Deeper Look Into HSAIL And It's Runtime

HSAIL Norm Rubin Fellow

An introduction to the HSA Intermediate language

2 | hsail AFDS | June 11, 2012

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions

and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited

to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product

differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation

to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make

changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO

RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS

INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY

DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL

OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF

EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this

presentation are for informational purposes only and may be trademarks of their respective owners.

OpenCL is a trademark of Apple Inc. used with permission by Khronos. DirectX is a registered trademark of Microsoft Corporation.

© 2012 Advanced Micro Devices, Inc. All rights reserved.


WHAT IS SPLIT COMPILATION?

App starts a source program

1) A high level compiler (HLC) generates HSAIL

2) The HSAIL is shipped to the target machine

3) A second compiler (a finalizer) turns HSAIL into ISA

Unlike traditional compilers, where optimization is contained in one part or done twice

HSAIL allows optimization to be split into two parts

The heavy lifting goes to the HLC , the quick finish goes to the finalizer

HSAIL provides ways for an HLC and a finalizer to cooperate For instance:

HSAIL provides a fixed number of registers.

HSA implementations might support a different number

When the HLC spills registers, it can use special operations that will let the finalizer know

where to use extra registers.


SPLIT COMPILATION

(MEANS THERE HAS TO BE WAYS TO PASS INFORMATION FROM HLC TO FINALIZER)

HLC – High level compiler

Lots of time

Info from source

Lots of aggressive optimizations

But limited (or no) knowledge of target

Finalizer

Very little time (we estimate that it will take close to linear time)

No info not in HSAIL (no back doors (almost)

Cannot update regularly (close to bug free)

Simple optimizations only

But knows the target

Exactly how to split some optimizations is still an open problem


WHY A VIRTUAL ISA - WHY NOT JUST TARGET THE REAL ISA?

ISA Gains performance Better time to market (because hardware is finished faster)

Loses performance (cannot use every hardware trick)

No legacy boat anchor

Real isa means one vendor/ one chip family

Can fix hardware bugs in software

Old and new code just works on old and new machines

Allows hardware innovation under the table

Features not in HSAIL are not exposed, and are hard to access


Development tools at HSAIL level

Today the need for a complete tool chain for each core, each with its own technology, switches etc., is a significant maintenance problem.

Debuggability, reproducibility.

Because the same application needs to run on different pieces of hardware, current source code contains many conditional preprocessing directives

Programmers rely on compiler intrinsic and ad-hoc command line arguments to drive the

optimization. This severely impacts code readability and productivity, and the application

binary tested and debugged on a workstation is different from the one that eventually runs on the system.

Platform openness.

Independent software vendors rarely have access to the tool chains needed to program the

most powerful parts of the system, namely the DSPs and hardware accelerators. Virtualization

can make the whole platform programmable, opening opportunities to third-party high-performance applications

.Performance through time to market

Because of the finalizer, last minute fixes can happen after the chip is finished. This means that

the time to release a new part goes down. Less time per generation translates to better

performance


GOALS OF HSAIL

1. Can support all of C++ (open up the GPU to mass programming, not only for specialists)

2. Avoid constant change (do not change the spec every chip)

3. Support accurate IEEE floating point math

4. Target lots of different machines

5. Allow for packed operations, SSE and friends, bytes/shorts/ints/doubles etc

6. Allow packed forms to save power

7. Make the model understandable

8. Make the finalizer fast (around linear time)

9. Make the finalizer simple (do not need monthly updates)

10. Less ambiguity in the spec (little undefined behavior)

11. Get good performance (little need to write in ISA)

12. Support all of OpenCL™ and C++Amp™

13. Can ship linkable libraries in HSAIL

14. Clean up all nits in AMDIL

15. Allow the use of chip specific acceleration when it is a good idea


HSAIL – LOTS OF NEW FEATURES

Lots of features not in OpenCL and C++ AMP

Enough to implement C++

Exceptions/ heterogeneous compute

Flat address space (work items on the GPU and agents on the CPU)

Because of hand written HSAIL, these features can be exposed early

Fine-grain barriers that work inside control flow, you can implement producer consumer models

Lots of cross wave operations – so you can quickly move data between lanes without loads and stores

Spec is available on the web site

The memory model shows how the CPU and GPU can cooperate

Support for image operations


PARALLELISM MODEL


WAVEFRONTS

Most developers will not care about wavefronts

Similar to cache line sizes

Experts can get good performance if they code to the cache line size

Compiler has to avoid breaking the developers model

HSAIL formalizes the notion of wavefronts

you can tell which work item goes into which wavefront

you can write producer consumer parallelism between work groups


AN EXAMPLE (IN OPENCL™)

__kernel void vec_add (__global const float *a, __global const float *b, __global float *c,

const unsigned int n)

{

// Get our global thread ID

int id = get_global_id(0);

// Make sure we do not go out of bounds

if (id < n) {

c[id ] = a[id] + b[id];

}


VECTOR ADD A[0:N-1] = B[0:N-1] + C[0:N-1]

version 1:0:$small;

kernel &__OpenCL_vec_add_kernel(

kernarg_u32 %arg_a

kernarg_u32 %arg_b,

kernarg_u32 %arg_c,

kernarg_u32 %arg_n)

{ @__OpenCL_vec_add_kernel_entry:

// BB#0: // %entry

ld_kernarg_u32 $s0, [%arg_n];

workitemaid $s1, 0;

cmp_lt_b1_u32 $c0, $s1, $s0;

ld_kernarg_u32 $s0, [%arg_c];

ld_kernarg_u32 $s2, [%arg_b];

ld_kernarg_u32 $s3, [%arg_a];

cur $c0, @BB0_2;

brn @BB0_1;

@BB0_1: // %if.end

ret;

@BB0_2: // %if.then

shl_u32 $s1, $s1, 2;

add_u32 $s2, $s2, $s1;

ld_global_f32 $s2, [$s2];

add_u32 $s3, $s3, $s1;

ld_global_f32 $s3, [$s3];

add_f32 $s2, $s3, $s2;

add_u32 $s0, $s0, $s1;

st_global_f32 $s2, [$s0];

brn @BB0_1;

};


MEMORY SEGMENTS

Memory is split into 7 segments

kernarg, global, arg, readonly, private, group, and spill

There is a single flat address space with everything but its is often advantageous to tell the finalizer

which segment to use

Load/store machine with registers

Some segments are used for intent –

– Spill indicates that the slot was used by the HLC for register spilling


SEGMENTS

Private

Private Spill locations are in private

Arg locations are in private

Flat address space

Work group Work group

NDRange

Agent

group

Group

Private within flat

Group within flatarg memory is within Private

spill memory is within PrivateprivateRW is within Privatekernarg is within GlobalReadOnly is within Global

Work Items


HSAIL FEATURES REGISTERS AND

TYPES

Four classes of registers

c/s/d/q

1 bit

32 bits

64 bits

128 bits

Both Binary (BRIG) and text format

The binary format is fully specified

120 opcodes (JavaByte code has 200)

Types

Brigs8, Brigs16, Brigs32, Brigs64,

Brigu8, Brigu16, Brigu32, Brigu64,

Brigf16, Brigf32, Brigf64, Brigb1,

Brigb8, Brigb16, Brigb32, Brigb64,

Brigb128, Brigu8x16,

BrigROImg, BrigRWImg, BrigSamp,

Brigu8x4, Brigs8x4, Brigu8x8, Brigs8x8,

Brigs8x16,

Brigu16x2, Brigs16x2, Brigf16x2,

Brigu16x4, Brigs16x4, Brigf16x4, Brigu16x8,

Brigs16x8,

Brigf16x8, Brigu32x2, Brigs32x2,

Brigf32x2, Brigu32x4, Brigs32x4, Brigf32x4,

Brigu64x2, Brigs64x2,

Brigf64x2


WHY DOES HSAIL LOOK THIS WAY?

An SIMT model (single instruction, multiple threads) claims that every work-item has a program counter

So branch instructions look pretty natural

A vector machine model looks like sse, one program counter and vector registers, this is like real AMD GPU

hardware

SIMT or Vector?


PROS FOR SIMT

We want HSAIL to outlast one hardware generation (so at the very least the vector length and real types/number of registers should not get exposed). Even with a vector model the finalizer will still have to map to the real vector length. We expected this to mean that a vector finalizer would not have a much simpler time We want to support lots of machines including ones not built by AMD We can add cross lane operations (like count) to the SIMTmodel so the line between SIMT and vector is blurry We want to open up to 3rd party compiler and tools, all of which can support SIMT but few of which can support vector Work groups is a much more developer friendly model than wavefronts Natural path for OpenCL™/CUDA ™ c++amp™ Graphics is SIMT, so the pressure to make future hardware work well for SIMT is immense


PROS FOR VECTOR

Might get more performance, we estimated <10% even in good cases Simpler for expert programmers to reason out what is going on This was a big one for us, the exact rules on wavefront re-convergence are hidden in the SIMTmodel but clear in the vector one In the vector model you can prove some results about code, which cannot be done when the finalizer reorders things On the other hand constructs like C++ virtual functions become very confusing on a vector machine, where the original program was SIMT We think the performance deficits are a reasonable trade for broader adoption, and in many cases can be closed by well written libraries for the cases that really matter.


HSAIL AND FUNCTIONS

{

arg_u32 %input1;

arg_u32 %input2;

// …

call &fnWithTwoArgs ()(%input1, %input2); // call of a function

// all work-items call the same function

}

// ...

HSAIL supports

Virtual functions,

Signatures

Jumps via a register

Load address of code


HSAIL PROVIDES A SERIES OF OPTIMIZATION CONTROLS

Sometimes you know if an operation is uniform over a range

ld_f32_width(8) $s1, address

Work items in groups of 8 will read the same value

call_width(64) $s1

Even through this is a call through register, work items in groups of 64 will call the same function

ld_equiv(3)_u32 $s1, address

A block of memory that cannot alias with other blocks


HSAIL COMPARED TO LLVM-IR

HSAIL is low level

assumes finalizer does not do as much optimization

no phi nodes,

finite register count

No ssa input

Parallelism is built into HSAIL

No need to hack the meaning of a barrier

No structures or other high level features


HSAIL COMPARED TO JAVA BYTE CODE

HSAIL is more focused on performance,

HSAIL has registers not a stack

HSAIL has parallelism built in

HSAIL is not as focused on security (does not require a formal validator)

Not quite write once

HSAIL is less concerned about code compression


HSAIL COMPARED TO AMDIL

HSAIL supports lots of complex control flow

AMDIL provides structured control flow only

irreducible flow needed exponential compile time

No (or limited) graphics features

just enough for C++ AMP™ and OpenCL™

four sizes of registers 1/32/64/128 bit vs. 4x32 vector registers (no more .x, .y, .z, .w) fields

HSAIL is extendable (per vendor/per chip extensions)

Different cost model


HSAIL COMPARED TO PTX

More formal model of execution

possible to write valid programs that pass data between work groups

More formal model of memory - acq/rel semantics

Less semantics defined by the device

Support for libraries and complex calls

Interaction between agents and HSAIL code,

shared memory, support for GPU to call CPU services

Per vendor extension mechanism

Clean separation of core features and per device operations

Support for linking/ libraries/ separate compilation

Removal of hard to finalize features

no predication


MEMORY MODEL

A memory model defines how writes by one work-item or agent become visible toother work-items and agents.

For many implementations, better performance will result if either the hardware or the finalizer is allowed to reorder code. For example, the finalizer might find it more efficient if a write is moved later in the program; so long as the program semantics do not change, the finalizer is free to do so. Once a store is deferred, other work-items and agents will not see it until the store actually happens. Hardware might provide a cache that also defers writes.

The HSAIL memory model is based on acquire release

An ld_acq creates a “downward fence.” This means that normal loads and stores can be moved (by the implementation) down past the ld_acq but no memory operation (load, store, or atomic) can be moved up above the ld_acq.

A st_rel creates an “upward fence.” That means that normal loads and stores can be moved (by the implementation) above the st_rel but no memory operation (load, store, or atomic) can be moved down after the st_rel.


Original Axiomatic Definition [Lamport 1979]

A single processor (core) sequentially consistent if

“the result of an execution is the same as if the operations had been executed in the order specified

by the program.”

A multiprocessor sequentially consistent if

“the result of any execution is the same as if the operations of all processors (cores) were executed in

some sequential order, and the operations of each individual processor (core) appear in this sequence

in the order specified by its program.”


SEQUENTIAL CONSISTENCY (SC) OPERATIONAL DEFINITION

System

1 memory

P simple processors

Operation: Pick one ready row, do it, & repeat until

done

Processor 0 ready to load/store of memory

…

Processor P-1 ready to load/store of memory

MEMORY

P P P


SEQUENTIAL CONSISTENCY

Any SC implementation must only permit executions allowed by SC operational model (SC executions).

The SC operational model is NOT a performance model.

SC implementation performance != Counting operation model steps

The operational model hides most implementation techniques

pipelining, out-of-order, speculation, caches, cache coherence, …

HW must functional behave “as if” is was like operational model

HW designers & verifiers often most comfortable with operational model

Each processor is eventually selected


HSAIL OPERATIONAL DEFINITION

System

1 (host) memory

P simple processors

Reorder buffer Writes can get held

Reads can be satisfied

Operation: Pick one ready row, do it, & repeat until done

Processor 0 ready to load/store of memory

…

Processor P-1 ready to load/store of memory

write values may stay in reorder buffer, reads may come out of the reorder buffer,

Rules to move between reorder buffer and memory

rel = release the values from the buffer, acq = acquire new values

MEMORY

P P P


WITHIN ONE WORK ITEM

SEQUENCED BEFORE

This is the order operations appear in the source

What you see looking at the code

single work item - “as-if-serial” view

- each operation appears to happen in the order it appears in the source

X sb Y

- X and Y in same work item,

- X sequenced before Y

multiple work items and agents makes this more complex


BETWEEN WORK ITEMS

X >> Y

What the memory system sees

memory system must see X before Y

global visibility order

this is transitive

X >>Y, and Y >> Z, then X >>Z


RULES, SOMETIMES

X SB Y => X >> Y

•X sb Y, same address, then X >>Y

•Different address

–If there is a barrier or sync between X and Y then X >>Y

•If X is an acquire:

– ld_acq, atomic_acq, atomicNoRet_acq, atomic_ar, atomicNoRet_ar

–Then X >> Y

–This is one sided (Y cannot move before X)

The general rule is use acquire and release when you want to force order

Acquire and Release may take extra time, but they give you sequential constancy

Compilers can trade performance for simple cross work-item communication


•If Y is a release

–st_rel, atomic_ar or atomicNoRet_ar then X >>Y

–st rel is another one way fence

•Consider a critical region (can use acquire and release to form critical sections)

•ld_acq x

•Assorted memory operations

•st_rel y

•No operations can move out, but operations can move in


AN EXAMPLE SB ORDER DOES NOT FORCE MEMORY ORDER

Work-item 0 Work-item 1

------------------- ------------------------------------

@h0: st_u32 1, [&a] @k0: st_u32 1, [&b]

@h1: ld_u32 $s0, [&b] @k1: ld_u32 $s1, [&a]

Initially, &a and &b = 0. $s0 = 0 and $s1 = 0 is allowed. --

constraints added because readers have to follow writers. k1 (the reader)

has to happen before h0 changes the value. There are also constraints caused by synchronization

h1 >> k1 >> h0 >> k0.

Even though h0 appears first (in sequenced-before order) before h1, there is no

requirement that the operations appear in text order (sequenced-before order) to the

memory system.


EXAMPLE 2 REGISTER DEPENDENCE DOES NOT FORCE MEMORY ORDER


----------------------- ---------------------

@h0: ld $s0, [&a] @j0: st 20, [100]

@h1: ld $s1, [$s0] @j1: st_rel 100, [&a]

Initially, &a and contents of location 100 = 0.

$s1 == 0 and $s0 == 100 is allowed

If $s1 == 0 then h1 >> j0. f $s0 == 100 then j1 >> h1.

Because this seems to violate dependence order, it is useful to consider how this can

come about.

Work-item 0 is allowed to prefetch load h1. One reason it might do this is that code before these operations

reads address 96, and the implementation reads in large cache lines.

Later, work-item 1 reads the new value of &a, which is 100. Then it reads the value of

location 100, but because there is no synchronization, it can use the previously prefetched value of 0.


EXAMPLE 3


@h0: ld_acq $s0, [&a] @j0: st 20, [100]

@h1: ld $s1, [$s0] @j1: st_rel 100, [&a]

Initially, &a and 100 = 0.

HSAIL does not allow $s1 == 0 and $s0 == 100.


QUESTIONS?

Technology

Deeper Look Into HSAIL And It's Runtime