42
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY OCELOT: SUPPORTED DEVICES

OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

OCELOT: SUPPORTED DEVICES

Page 2: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overview

�Multicore-Backend

�NVIDIA GPU Backend

�AMD GPU Backend

�Ocelot PTX Emulator

2

Page 3: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overview

�Multicore-Backend

�NVIDIA GPU Backend

�AMD GPU Backend

�Ocelot PTX Emulator

3

Page 4: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Multicore CPU Backend: Introduction

�Target: Efficient execution of PTX kernels on CPUs

� ISA Translation from PTX to LLVM

� Execution-model translation from PTX thread hierarchy to serialized PTX threads

�Light-weight thread scheduler

�LLVM Just-in-time compilation to x86

� LLVM transformations applied before code gen

Page 5: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Some Interesting Features

Serialization Transforms

JIT for Parallel Code

Utilize all resources

5

Page 6: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Translation to CPUs: Thread FusionExecution Manager• thread scheduling• context managementThread Blocks

Multicore Host Threads

Thread serialization

� Execution Model Translation

� Thread scheduling

� Dealing with specialized operations, e.g., custom hardware

� Control flow restructuring

� Resource management (multiple cores)

� Multiple address spaces

One worker pthread per CPU core

Execute a kernel

6

J. Stratton, S. Stone, and W. mei Hwu, Mcuda: An efficient implementation of cuda kernels on multi-cores," University of Illinois at Urbana-Champaign, Tech. Rep. IMPACT-08-01,March 2008.

G. Diamos, A. Kerr, S. Yalamanchili and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk-Synchronous Applications in Heterogeneous,” PACT October 2010

Page 7: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: Multicore CPU Backend• ocelot/

• executive/• interface/MulticoreCPUDevice.h• interface/LLVMContext.h• interface/LLVMExecutableKernel.h• interface/LLVMCooperativeThreadArray.h• interface/LLVMModuleManager.h• interface/TextureOperations.h

• ir/• interface/LLVMInstruction.h

• translator/ • interface/PTXToLLVMTranslator.h

• transforms/ • interface/SubkernelFormationPass.h• interface/RemoveBarrierPass.h

7

Page 8: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Multicore CPU: ISA Translation

� Translate PTX IR to LLVM Internal Representation

� Arithmetic instructions have one-to-few mapping

� Load store architectures

� Special instructions and registers handled by LLVM intrinsics (e.g. cos, clock64, bar.sync)

� Texture sampling calls Ocelot’s texture library

� LLVMContext contains pointers to address spaces, next entry ID, thread ID

� Custom LLVM IR implementation insulates Ocelot from LLVM changes

� LLVM requires SSA form -> Ocelot converts PTX to SSA

� Remove predication

Page 9: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

PTX to LLVM ISA Translation

//

// ocelot/translation/implementation/PTXToLLVMTranslator.cpp

//

void PTXToLLVMTranslator::_translateAdd(

const ir::PTXInstruction& i )

{

if( ir::PTXOperand::isFloat( i.type ) )

{

ir::LLVMFadd add;

ir::LLVMInstruction::Operand result = _destination( i );

add.a = _translate( i.a );

add.b = _translate( i.b );

add.d = result;

_llvmKernel->_statements.push_back(

ir::LLVMStatement( add ) );

}

else {

..

..

..

};

• Translate each PTX instruction to LLVM IR instruction sequence

• Special PTX registers and instructions mapped to LLVM intrinsics:

• llvm.readcyclecounter()

• llvm.sqrt.f32()

• Result is LLVM function implementing PTX kernel

• Should be invertible if coupled to LLVM->PTX code generator (not implemented)

Page 10: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Thread Serialization

� Thread loops� Enter next executable region via scheduler block

� Barriers:� store live values into thread-local memory, return to thread scheduler

Page 11: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Execution Management

�Translation takes place over (sub)kernels

�Code cache for translated kernels

�Must synthesize thread scheduling (serialization) code

11

Thread Serialization

Page 12: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Spilling Live Values// ocelot/analysis/implementation/RemoveBarrierPass.cpp

//

void RemoveBarrierPass::_addSpillCode( DataflowGraph::iterator block,

const DataflowGraph::Block::RegisterSet& alive )

{

unsigned int bytes = 0;

ir::PTXInstruction move ( ir::PTXInstruction::Mov );

move.type = ir::PTXOperand::u64;

move.a.identifier = "__ocelot_remove_barrier_pass_stack";

move.a.addressMode = ir::PTXOperand::Address;

move.a.type = ir::PTXOperand::u64;

move.d.reg = _kernel->dfg()->newRegister();

move.d.addressMode = ir::PTXOperand::Register;

move.d.type = ir::PTXOperand::u64;

_kernel->dfg()->insert( block, move, block->instructions().size() - 1 );

...

Page 13: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

...

for( DataflowGraph::Block::RegisterSet::const_iterator

reg = alive.begin(); reg != alive.end(); ++reg ) {

ir::PTXInstruction save( ir::PTXInstruction::St );

save.type = reg->type;

save.addressSpace = ir::PTXInstruction::Local;

save.d.addressMode = ir::PTXOperand::Indirect;

save.d.reg = move.d.reg;

save.d.type = ir::PTXOperand::u64;

save.d.offset = bytes;

bytes += ir::PTXOperand::bytes( save.type );

save.a.addressMode = ir::PTXOperand::Register;

save.a.type = reg->type;

save.a.reg = reg->id;

_kernel->dfg()->insert( block, save, block->instructions().size() - 1 );

}

_spillBytes = std::max( bytes, _spillBytes );

Spilling Live Values

Page 14: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Using the Multicore Backend

� Edit configure.ocelot

� Controls Ocelot’s initial state

� Located in application’s startup directory

� executive controls device properties

� trace:

� Trace Generators may be active for devices other than PTX Emulator

� Only initialize(), finish() called

� event() and postEvent() never called

� Enables uniform interface for profiling kernel launches

� executive:

� devices:

� llvm – efficient execution of PTX on multicore CPU

� optimizationLevel – basic, none, full, memory, debug

� workerThreadLimit -- number of worker threads

� optimizations:

� subkernelSize - size of subkernels in instructions

� simplifyCFG – whether to apply CFG simplification pass

� hoistSpecialValues – whether to load LLVMContextvalues at launch of kernel

executive: {

devices: [ llvm ],

asynchronousKernelLaunch: true,

optimizationLevel: none,

workerThreadLimit: 1,

warpSize: 1

},

optimizations: {

subkernelSize: 1000,

simplifyCFG: true,

hoistSpecialValues: true

},

14

Page 15: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overview

�Multicore-Backend

�NVIDIA GPU Backend

�AMD Backend

�Ocelot PTX Emulator

15

Page 16: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

NVIDIA GPU: Introduction

� Executes PTX kernels on GPUs via the CUDA Driver API

� Thin layer on top of CUDA Driver API

� Ocelot enables rewriting of PTX kernels

� Register reallocation

� Runtime optimizations

� Instrumentation

Page 17: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: NVIDIA GPU Device Backend• ocelot/

• executive/• interface/NVIDIAGPUDevice.h• interface/NVIDIAExecutableKernel.h

17

Page 18: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Using the NVIDIA GPU Backend

� Edit configure.ocelot

� Controls Ocelot’s initial state

� Located in application’s startup directory

� executive controls device properties

� trace:

� Trace Generators may be active for devices other than PTX Emulator

� Only initialize(), finish() called

� event() and postEvent() never called

� Enables uniform interface for profiling kernel launches

� executive:

� devices:

� nvidia – invokes NVIDIA GPU backend

executive: {

devices: [ nvidia ],

},

18

Page 19: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Dynamic Instrumentation

�Run-time generation of user-defined, custom instrumentation code for CUDA kernels

� Harness chip-level instrumentation when possible

�Instrumentation data to drive

� Off-line workload characterization

� On-line debugging & program optimization

� On-line resource management

�Inspired in part by the PIN1 infrastructure

19

19

1 C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. “Pin: building customized program analysis tools with dynamic instrumentation,” PLDI '05

PhD Student: Naila Farooqui, Joint with K. Schwan and A. Gavrilovska

Page 20: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Instrumentation Support in Ocelot

� High-level, C constructs to define instrumentation + (C-to-PTX) JIT

� Integration with system management software and dynamic compiler

� Online resource management based on profiling

� Additional Instrumentor APIs to provide criteria for instrumentation

� Selectively perform instrumentation on kernels

20

20

Page 21: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Custom Instrumentation

�Transparent profiling and characterization of library implementations

21

nvcc

PTX

Ocelot Run Time

CUDA

Libraries

Instrumentation APIs

Inst

rum

ento

r

C-on-Demand JIT

C-PTX Translator

PTX-PTX Transformer

Lynx

Example Instrumentation

Code

21

Page 22: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Instrumentation: Instruction count

* Scan (CUDA SDK)

Page 23: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Remote Device Layer

�Remote procedure call layer for Ocelot device calls

�Execute local applications that run kernels remotely

�Multi-GPU applications can become multi-node

23

23

Page 24: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Switchable Compute

�Switch devices at runtime

� Load balancing

� Instrumentation

� Fault-and-emulate

� Remote execution

Page 25: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overview

�Multicore-Backend

�NVIDIA Backend

�AMD GPU Backend

�Ocelot PTX Emulator

25

Rodrigo Dominguez, Dana Schaa, and David Kaeli. “Caracal: Dynamic Translation of Runtime Environments for GPUs.” In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4

Page 26: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

AMD GPU Backend

�Executes PTX kernels on GPUs via the CAL Driver API

�Rewriting of PTX kernels (for optimization, instrumentation, etc.) also gets translated to the AMD backend

�Ocelot Device Interface:� Module registration

� Memory management

� Global/Shared/Constant/Parameter memory allocation

� Kernel launches

� Translation from PTX to IL

� Texture management

� OpenGL interoperability

� Streams and Events

Rodrigo Dominguez, Dana Schaa, and David Kaeli. “Caracal: Dynamic Translation of Runtime Environments for GPUs.” In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4

Page 27: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

AMD Evergreen Architecture

� AMD Radeon HD 5870

� 20 SIMD cores

� 16 Stream Cores (SC) per SIMD core

� Each SC is VLIW-5

� A total of 1600 ALUs

� Wavefronts of 64 threads

� Peak is 2.72 TFLOPS (SP) and 544 GFLOPS (DP)

Page 28: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

AMD Evergreen Architecture

One SIMD Engine

Source: AMD OpenCL University Kit

General Purpose Registers

One Stream Core

T-Processing Element Branch

Execution Unit

ProcessingElements

Instruction and Control Flow

Each Stream Core includes:

� 4 Processing Elements

� 4 independent SP or integer operations

� 2 DP operation� 1 DP fma or mult

operation

� 1 Special Function Unit

� 1 SP or integer operation� SP or DP transcendental

� Branch Execution Unit

� GPR = 5.24 MB

Page 29: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

AMD Evergreen Architecture

� Local Data Share� 2 TB/s� 32 KB per SIMD

� Global Data Share� Shared between all

threads in a kernel� Low latency global

reductions

� L1 (8 KB)� L2

� 512 KB� 450 GB/s

� Global Memory� GDDR5 153 GB/s� CompletePath� FastPath

Page 30: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Memory Hierarchy

Crossbar

L1 Cache

SIMD EngineLocal Mem + Registers

L2 Cache Write Cache

Atomic Path

Global Memory

Page 31: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Memory Hierarchy

�Benefits from vector operations (int4, float4)

�Atomics are faster on local memory than in global memory (FastPath vs CompletePath)

�Typical tiled data layout for images to hit L1 cache

�Compiler optimizes by:

� Minimizing ALU code

� Maximizing number of threads

� Scheduling instructions to increase VLIW packing

Page 32: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Address Spaces

�Unordered Access Views (raw)

� 8 different UAVs

� Byte-addressable (linear)

� Dword (4 bytes) alignment

� Arena UAV for sub-dword data

�Constant Buffers

� Non-linear addressing (x, y, z, w components)

�Local Data Share

� Byte-addressable (linear)

� Dword-aligned and dword-sized (pack/unpack overhead)

Page 33: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Translation from PTX to IL

PTX

�RISC style syntax

�Load-Store instruction set

�Registers are typed and scalar

�Unlimited virtual registers

�Predicate registers

�Control flow based on branches and labels

�Designed for compute (GPGPU)

.entry vecAdd (.param .u64 A,.param .u64 B,.param .u64 C,.param .s32 N)

{mov.u16 rh1, ctaid.x;mov.u16 rh2, ntid.x;mul.wide.u16 r1, rh1, rh2;cvt.u32.u16 r2, tid.x;add.u32 r3, r2, r1;ld.param.s32 r4, [N];setp.le.s32 p1, r4, r3;@p1 bra Label_1;...

}

Page 34: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Translation from PTX to IL

IL

�Registers are 32-bit and vectors (4 components)

�Registers have no type

�Swizzles and destination modifiers

�Resources are globally scoped

�Structured control flow (if-end, while-end)

�Designed for graphics, not compute (see FSAIL)

il_cs_2_0dcl_raw_uav_id(0)dcl_cb cb0[2]dcl_cb cb1[4]dcl_literal l0, 4, 4, 4, 4mov r0.x, vThreadGrpId.xmov r1.x, cb0[0].ximul r2.x, r0.x, r1.xmov r3.x, vTidInGrp.xiadd r4.x, r3.x, r2.xmov r5.x, cb1[3].xige r6.x, r4.x, r5.xif_logicalz r6.x...endif

end

Page 35: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

AMD GPU Backend

�Validated over 30 applications from the CUDA SDK

�Support for pre-compiled libraries

�Device selection can be made at runtime

�What is supported?

� Global memory (cudaMalloc, cudaMemcpy)

� Shared memory (including extern)

� Constant memory (no caching)

� Atomics (global and shared)

� Barriers and Fences

� 30+ PTX instructions

Rodrigo Dominguez, Dana Schaa, and David Kaeli. Caracal: Dynamic Translation of Runtime Environments for GPUs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4

Page 36: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: AMD GPU Device Backend• ocelot/

• analysis/• interface/StructuralAnalysis.h

• executive/• interface/ATIGPUDevice.h• interface/ATIExecutableKernel.h

• transforms/• interface/StructuralTransform.h

36

Page 37: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Using the AMD GPU Backend

� Edit configure.ocelot

� Controls Ocelot’s initial state

� Located in application’s startup directory

� executive controls device properties

� trace:

� Trace Generators may be active for devices other than PTX Emulator

� Only initialize(), finish() called

� event() and postEvent() never called

� Enables uniform interface for profiling kernel launches

� executive:

� devices:

� amd – invokes AMD GPU backend

executive: {

devices: [ amd ],

},

37

Page 38: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Unstructured to Structured Control Flow*

�Branch Divergence is key to high performance in GPU

� Its impact is different depending upon whether the control flow is structured or unstructured

�Not all GPUs support unstructured CFG directly

� Using dynamic translation to support AMD GPUs**

38

* Wu H, Diamos G, Li S, Yalamanchili S. Characterization and Transformation of Unstructured Control Flow in GPU Applications. CACHES. 2011.** R. Dominguez, D. Schaa, and D. Kaeli. Caracal: Dynamic translation of runtime environments for gpus. In Proceedingsof the Fourth Workshop on General Purpose Processing on Graphics Processing Units, pages 5–11. ACM, 2011.

Page 39: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Structured/Unstructured Control Flow

�Structured Control Flow has a single entry and a single exit

�Unstructured Control Flow has multiple entries or exits

39

Exit

Entry

if-then-else

Entry/Exit

for-loop/while-loop do-while-loop

Entry

Exit

Page 40: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Sources of Unstructured Control Flow (1/2)

�goto statement of C/C++

�Language semantics

40

•Not all conditions need to be evaluated

•Sub-graphs in red circles have 2 exits

B1

bra cond1()

B4

bra cond4()

B2

bra cond2()

B3

bra cond3()

B5

……

entry

exit

if (cond1() || cond2())&& cond3() || cond4()))

{……

}

Page 41: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Re-convergence in AMD & Intel GPUs

�AMD IL does not support arbitrary branch

� It also uses ELSE, LOOP, ENDLOOP, etc.

�Intel GEN5 works in a similar manner

41

ige r6, r4, r5if_logicalz r6uav_raw_load_id(0) r11, r10uav_raw_load_id(0) r14, r13iadd r17, r16, r8uav_raw_store_id(0) r17, r15endif

if (i < N){

C[i] = A[i] + B[i]}

C Code AMD IL

Page 42: OCELOT: SUPPORTED DEVICEScomparch.gatech.edu/hparch/tutorial_slides/hpca2012/hpca... · 2012-02-23 · OCELOT: SUPPORTED DEVICES. ... Overview Multicore-Backend NVIDIA GPU Backend

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Entry Entry EntryEntry Entry EntryEntry

B1 B1 B1B1 B1 B1B1

B2 B2 B2

B3 B3

B4 B4

B5

T0 T1 T2 T3 T4 T5 T6

B2

B3

Re-converge at immediate post-dominator

42

B1

bra cond1()

B4

bra cond4()

B2

bra cond2()

B3

bra cond3()

B5

……

entry

exit

B5

B3 B3B3

B4 B4

B5

B5

Exit Exit ExitExit Exit ExitExit

Entry Entry EntryEntry Entry EntryEntry

B1 B1 B1B1 B1 B1B1

B2 B2 B2B2

B3 B3B3

B4 B4

B5

T0 T1 T2 T3 T4 T5 T6

B3

B4

B3

B4

B5

B3

B5

Exit Exit ExitExit Exit ExitExit

1

2

3

4

5

6

7

8

9

10

11

12

B5B5

B3 B3B3

B4 B4

B5

B5

B3 B3B3

B4 B4

B5