Overview of Ocelot: architecture

Preview:

DESCRIPTION

Overview of Ocelot: architecture. Overview. GPU Ocelot overview Building, configuring, and executing Ocelot programs Ocelot Device Interface and CUDA Runtime API Ocelot PTX Internal Representation PTX Pass Manager. 2. Ocelot: Multiplatform Dynamic Compilation. - PowerPoint PPT Presentation

Citation preview

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

OVERVIEW OF OCELOT: ARCHITECTURE

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

OverviewGPU Ocelot overview

Building, configuring, and executing Ocelot programs

Ocelot Device Interface and CUDA Runtime API

Ocelot PTX Internal Representation

PTX Pass Manager

2

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot: Multiplatform Dynamic Compilation

Just-in-time code generation and

optimization for data intensive

applications

esd.lbl.gov

R. Domingo & D. Kaeli (NEU)

Data Parallel IR

Language Front-

End

• Environment for i) compiler research, ii) architecture research, and iii) productivity tools

3

3

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

NVIDIA’s Compute Unified Device Architecture (CUDA)

Integrate the concept of a compute kernel called from standard languages

Multithreaded host programsThe compute kernel specifies data parallel computation as thousands of threads

An accelerator model of computing Explicit functions for off-loading computation to GPUs Data movement explicitly managed by the programmer

4

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

http://developer.nvidia.com/cuda-education-training

Host GPU

For access to CUDA tutorials

NVIDIA’s Compute Unified Device Architecture (CUDA)

5

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Structure of a Compute Kernel

Arrays of (data parallel) thread blocks called cooperative thread arrays (CTAs)

Barrier synchronizationMapped to single instruction stream multiple data stream (SIMD) processor

6

Parallel Thread Execution (PTX) instruction set architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

NVIDIA Fermi GF 100• 4 Global Processing Clusters

(GPCs) containing 4 SMs each

• Each SM has 32 ALUs, 4 SFUs, and 16 LS units

• Each ALU has access to 1024 32bit registers (total of 128kB per SM)

• Each SM has its own Shared Memory/L1 cache (64kB total)

• Unified L2 cache (768kB)• Six 64bit Memory Controllers

(total 384bit wide)

ALU Streaming multiprocessor (SM)

7

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Structure1 PTX Kernel

1G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk Synchronous Applications in Heterogeneous Systems,” PACT, September 2010. .

CUDA Application

nvcc

Ocelot is built with nvcc and the LLVM backend Structured around a PTX IR LLVM IR Translator

Compile stock CUDA applications without modification

8

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

CUDA to PTX

PTX modules stored as string literals in fat binary We ignore accompanying binary image (GPU native

binary)9

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

OverviewGPU Ocelot overview

Building, configuring, and executing Ocelot programs

Ocelot Device Interface and CUDA Runtime API

Ocelot PTX Internal Representation

PTX Pass Manager

10

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Dependencies Software

C++ Compiler (GCC 4.5.x) Lex Lexer Generator (Flex 2.5.35) YACC Parser Generator (Bison 2.4.1) Scons (Python 2.7) LLVM (3.1)

Libraries boost_system (1.46) boost_filesystem (1.46) boost_serialization (1.46) GLEW (optional for GL interop) (1.5) GL (for NVIDIA GPU Devices)

Library headers Boost (1.46)

http://code.google.com/p/gpuocelot/wiki/Installation

11

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code

• Freely available via Google Code project site (New BSD License)

• ocelot/• analysis/ -- analysis passes• api/ -- Ocelot-specific API extensions• cuda/ -- implements CUDA runtime• executive/ -- Device interface and backend implementations• ir/ -- internal representations (PTX, LLVM, AMD IL)• parser/ -- parser (to PTX)• tools/ -- standalone applications using Ocelot• trace/ -- trace generation and analysis tools• translator/ -- translators from PTX to LLVM and AMD IL• transforms/ -- program transformations

http://code.google.com/p/gpuocelot/

svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only

12

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 13

Building GPU Ocelot Obtain source code

svn checkout  http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only

Compile with Scons sudo ./build.py –install

Build and execute unit tests sudo ./build.py –test=full

Output appears in .release_build libocelot.so OcelotConfig Tests

Installation directory: /usr/local/include/ocelot /usr/local/lib

http://code.google.com/p/gpuocelot/wiki/Installation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Configuring Ocelot configure.ocelot

Controls Ocelot’s initial state Located in application’s startup directory trace specifies which trace generators are initially

attached executive controls device properties

trace: memoryChecker – ensures raceDetector - enforces synchronized access

to .shared debugger - interactive debugger

executive: devices:

List of Ocelot backend devices that are enabled nvidia - NVIDIA GPU backend emulated – Ocelot PTX emulator (trace generators) llvm – efficient execution of PTX on multicore CPU amd – translation to AMD IL for PTX on AMD RADEON

GPU

trace: { memoryChecker: { enabled: true, checkInitialization: false }, raceDetector: { enabled: false, ignoreIrrelevantWrites: true }, debugger: { enabled: false, kernelFilter:

"_Z13scalarProdGPUPfS_S_ii", alwaysAttach: true }, }, executive: { devices: [ "emulated" ], }}

14

14

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 15

Building and Executing CUDA Programsnvcc -c example.cu -arch sm_23

g++ -o example example.o `OcelotConfig -l` `OcelotConfig -l` expands to ‘-locelot’

libocelot.so replaces libcudart.so

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overview

GPU Ocelot overview

Building, configuring, and executing Ocelot programs

Ocelot Device Interface and CUDA Runtime API

Ocelot PTX Internal Representation

PTX Pass Manager

16

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

CUDA Runtime API

Ocelot implements CUDA Runtime API Transparent hooks into existing CUDA

applications override methods of

cuda::CudaDeviceInterface Maps CUDA RT onto Ocelot device interface

abstraction cuda::CudaRuntime

Extended through custom Ocelot API e.g. ocelot::registerPTXModule( );

17

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot CUDA Runtime Overview18

Kernels execute anywhere Key to portability!

A reimplementation of the CUDA Runtime API

Compatible with existing applications

Link against libocelot.so instead of libcudart

R. Domingo & D. Kaeli (NEU)

18

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot CUDA Runtime

Clean device abstraction

All back-ends implement same interface

Ocelot API Extensions Add/remove trace

generators Compile/launch kernels

directly in PTX Device memory sharing

among host threads Device switching

19

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: CUDA Runtime API• ocelot/

• analysis/ -- analysis passes• api/ -- Ocelot-specific API extensions

• cuda/ -- implements CUDA runtime

• interface/CudaRuntimeInterface.h• interface/CudaRuntime.h• interface/CudaRuntimeContext.h• interface/FatBinaryContext.h• interface/CudaDriverFrontend.h

• executive/ -- Device interface and backend implementations• ir/ -- internal representations (PTX, LLVM, AMD IL)• parser/ -- parser (to PTX)• tools/ -- standalone applications using Ocelot• trace/ -- trace generation and analysis tools• translator/ -- translators from PTX to LLVM and AMD IL• transforms/ -- program transformations

20

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 21

Ocelot CUDA Runtime API Implementation Implement interface defined by cuda::CudaRuntimeInterface

ocelot/cuda/interface/CudaRuntime.h ocelot/cuda/implementation/CudaRuntime.cpp class cuda::CudaRuntime

cuda::CudaRuntime members Host thread contexts Ocelot devices Registered modules, textures, kernels Fat binaries Global mutex

CUDA Runtime API functions eg. cudaMemcpy, cudaLaunch, __cudaRegisterModule(),

Additional functions eg. _lock(), _unlock(), _registerModule()

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: Device Interface• ocelot/

• executive/ -- Device interface and backend implementations

• interface/Device.h• interface/EmulatorDevice.h• interface/NVIDIAGPUDevice.h• interface/MulticoreCPUDevice.h• interface/ATIGPUDevice.h

22

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 23

Ocelot Device Interface class executive::Device Succinct interface for device objects

Module registration Memory management Kernel configuration and launching Global variable and texture management OpenGL interoperability Streams and Events Trace generators

Minimal set of APIs for device-oriented programming model 57 functions (versus CUDA Runtime’s 120+)

Capture device state: Memory allocations, global variables, textures, graphics interoperability

Facilitate creation of backend execution targets Implement Device interface

Enable multiple API front ends Implement front ends targeting Device interface

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overview

GPU Ocelot overview

Building, configuring, and executing Ocelot programs

Ocelot Device Interface and CUDA Runtime API

Ocelot PTX Internal Representation

PTX Pass Manager

24

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot PTX Intermediate Representation (IR)

Backend compiler framework for PTX Full-featured PTX IR

Class hierarchy for PTX instructions/directives PTX control flow graph Static single-assignment form Dataflow/dominance analysis Enables PTX optimization

IR to IR translation From PTX to other IRs LLVM (x86/PowerPC/ARM) CAL (AMD GPUs)

PTX Kernel

25

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: Intermediate Representation• ocelot/

• ir/ -- internal representations (PTX, LLVM, AMD IL)

• interface/Module.h• interface/PTXInstruction.h• interface/PTXOperand.h• interface/PTXKernel.h• interface/ControlFlowGraph.h• interface/ILInstruction.h• interface/LLVMInstruction.h

• parser/ -- parser (to PTX)

• interface/PTXParser.h

26

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 27

Ocelot PTX Internal Representation C++ classes representing PTX module

ir::PTXModule ir::PTXKernel ir::PTXInstruction ir::PTXOperand ir::GlobalVariable ir::LocalVariable ir::Parameter

Ocelot PTX Parser target, Emitter source ir::PTXInstruction::valid( )

Translator source PTX to LLVM PTX to AMD IL

Suitable for analysis and transformation Executable representation

PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot PTX IR: Kernels.global .f32 globalVariable;

.entry sequence (.param .u64 __cudaparm_sequence_A,.param .s32 __cudaparm_sequence_N){.reg .u32 %r<11>;.reg .u64 %rd<6>;.local u32 %rp0;

. . . . . .

$LDWbegin_sequence: ld.param.s32 %r6, [__cudaparm_sequence_N]; setp.le.s32 %p1, %r6, %r5; @%p1 bra $Lt_0_1026; . . . . . .$Lt_0_1026:

exit;$LDWend_sequence:

} // sequence

ir::Module

ir::Kernel

ir::BasicBlock

ir::Local

ir::Parameter

ir::Global

28

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

add.s32 %r7, %r5, 1;

ld .param .u64 %rd1, [__cudaparm_sequence_A];

cvt.s64.s32 %rd2, %r5;

mul.wide.s32 %rd3, %r5, 4;

add.u64 %rd4, %rd1, %rd3;

st .global .s32 [ %rd4 + 0 ], %r7;

@%p1 bra $Lt_0_6146;

ir::BasicBlockir::PTXInstruction

opcode addressSpace dataType d a

addressMode: address

addressMode: register

addressMode: immediate

addressMode: indirect

ir::PTXOperand

addressMode: label

Guard predicate

Ocelot PTX IR: Instructions

29

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Control and Data-Flow Graphs

• Data structure for representing kernels• Basic blocks

• fall-through and branch edges• instruction vector• label

• Traversals:• pre-order, topological, post-order• iterator visits blocks

• Data-flow graph overlays CFG• definition-use chains explicit• to and from SSA form

• CFG Transformations:• split blocks, edges

• DFG Transformations:• insert and remove values• iterate over def-use

30

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Example: Control-Flow Graphs// example: splits basic blocks containing barriers//for (ir::ControlFlowGraph::iterator bb_it = kernel->cfg()->begin(); bb_it != kernel->cfg()->end(); ++bb_it) { // iterate over basic blocks

unsigned int n = 0; ir::BasicBlock::InstructionList::iterator inst_it;

for (inst_it = (bb_it)->instructions.begin(); inst_it != (bb_it)->instructions.end(); ++inst_it, n++) { // iterate over instructions in *bb_it

const ir::PTXInstruction *inst = static_cast< const ir::PTXInstruction *>(*inst_it);

if (inst->opcode == ir::PTXInstruction::Bar) { if (n + 1 < (unsigned int)(bb_it)->instructions.size()) {

std::string label = (bb_it)->label + "_bar";

kernel->cfg()->split_block(bb_it, n+1, ir::BasicBlock::Edge::FallThrough, label); // split block containing bar.sync // so that it’s always the last } // instruction in a block break; } } // end for (inst_it)

} // end for (bb_it)

31

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Example: Spilling Live Values// ocelot/analysis/implementation/RemoveBarrierPass.cpp

//

void RemoveBarrierPass::_addSpillCode( DataflowGraph::iterator block, const DataflowGraph::Block::RegisterSet& alive ){

unsigned int bytes = 0;

ir::PTXInstruction move ( ir::PTXInstruction::Mov );

move.type = ir::PTXOperand::u64;

move.a.identifier = "__ocelot_remove_barrier_pass_stack";

move.a.addressMode = ir::PTXOperand::Address;

move.a.type = ir::PTXOperand::u64;

move.d.reg = _kernel->dfg()->newRegister();

move.d.addressMode = ir::PTXOperand::Register;

move.d.type = ir::PTXOperand::u64;

_kernel->dfg()->insert( block, move, block->instructions().size() - 1 );

...

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

...

for( DataflowGraph::Block::RegisterSet::const_iterator reg = alive.begin(); reg != alive.end(); ++reg ) {

ir::PTXInstruction save( ir::PTXInstruction::St );

save.type = reg->type;

save.addressSpace = ir::PTXInstruction::Local;

save.d.addressMode = ir::PTXOperand::Indirect;

save.d.reg = move.d.reg;

save.d.type = ir::PTXOperand::u64;

save.d.offset = bytes;

bytes += ir::PTXOperand::bytes( save.type );

save.a.addressMode = ir::PTXOperand::Register;

save.a.type = reg->type;

save.a.reg = reg->id;

_kernel->dfg()->insert( block, save, block->instructions().size() - 1 );

}

_spillBytes = std::max( bytes, _spillBytes );

}

Example: Spilling Live Values

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

IR for AMD and LLVM

LLVM IR• Implements all of the LLVM instruction set• Decouples translator with LLVM project• Easier to construct than LLVM’s actual IR

AMD IL• Supports translation from PTX to AMD

interface

Emitters construct parseable string representations of modules

AMD Backend: R. Domingo & D. Kaeli (NEU)

34

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overview

GPU Ocelot overview

Building, configuring, and executing Ocelot programs

Ocelot Device Interface and CUDA Runtime API

Ocelot PTX Internal Representation

PTX Pass Manager

35

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 36

PTX PassManager Orchestrates analysis and transformation passes

Derived from LLVM model Analysis Passes generate meta-data Meta-data consumed by transformations Transformation Passes modify the IR

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 37

Using the Pass Manager Passes added to a manager

Schedules execution Manages analysis meta-data

Ensures meta-data available Up to date; not redundantly computed

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 38

Analysis Passes Analysis runs over the PTX IR

Generates meta-data Modifies PTX IR Possibly updates or invalidates existing meta-data

Examples Data-flow graph Dominator and Post-dominator trees Thread frontiers

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Analysis Passes – Supported Analaysis Structures

39

Control Flow Graph ir/interface/ControlFlowGraph.h

Data Flow Graph analysis/interface/DataflowGraph.h

Dominator and Post-Dominator Trees analysis/interface/DominatorTree.h analysis/interface/PostDominatorTree.h

Superblock Analysis analysis/interface/SuperblockAnalysis.h

Divergence Graph analysis/interface/DivergenceGraph.h

Thread Frontiers analysis/interface/ThreadFrontiers.h

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 40

Transformation Passes Modify the PTX IR

Consume meta-data Examples:

Dead-code elimination transforms/interface/DeadCodeEliminationPass.h

Control-flow structuring transforms/interface/StructuralTransform.h

Sync elimination transforms/interface/SyncElimination.h

Dynamic instrumentation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 41

Example: Dead Code Elimination Transformation Pass

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 42

Dead Code Elimination Approach

Run once on each kernel Consume data-flow analysis meta-data Delete instructions producing values with no users Implementation

transforms/interface/DeadCodeEliminationPass.h transforms/implementation/DeadCodeEliminationPass.cpp

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 43

Dead Code Elimination (1 of 5) Setup pass dependencies

DeadCodeEliminationPass::DeadCodeEliminationPass(): KernelPass(Analysis::DataflowGraphAnalysis | Analysis::StaticSingleAssignment, "DeadCodeEliminationPass"){

}

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 44

Dead Code Elimination (2 of 5) Run pass

Analysis* dfgAnalysis = getAnalysis(Analysis::DataflowGraphAnalysis);assert(dfgAnalysis != 0);

// cast upanalysis::DataflowGraph& dfg = *static_cast<analysis::DataflowGraph*>(dfgAnalysis);assert(dfg.ssa());

void DeadCodeEliminationPass::runOnKernel(ir::IRKernel& k){

Get analysis metadata

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 45

Dead Code Elimination (3 of 5) Loop until change

BlockSet blocks;for (iterator block = dfg.begin(); block != dfg.end(); ++block){ report(" Queueing up BB_" << block->id()); blocks.insert(block);}

while(!blocks.empty()){ iterator block = *blocks.begin(); blocks.erase(blocks.begin()); eliminateDeadInstructions(dfg, blocks, block);}

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 46

Dead Code Elimination (4 of 5) Remove unused live-out valuesAliveKillList aliveOutKillList;for (RegisterSet::iterator aliveOut = block->aliveOut().begin(); aliveOut != block->aliveOut().end(); ++aliveOut){ if (canRemoveAliveOut(dfg, block, *aliveOut)) { report(" removed " << aliveOut->id); aliveOutKillList.push_back(aliveOut); }}for (AliveKillList::iterator killed = aliveOutKillList.begin(); killed != aliveOutKillList.end(); ++killed){ block->aliveOut().erase(*killed);}

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 47

Dead Code Elimination (5 of 5) Check if an instruction can be removedif (ptx.hasSideEffects()) return false;

for (RegisterPointerVector::iterator reg = instruction->d.begin(); reg != instruction->d.end(); ++reg) {

// the reg is alive outside the blockif (block->aliveOut().count(*reg) != 0) return false;InstructionVector::iterator next = instruction;for (++next; next != block->instructions().end(); ++next) {

for (RegisterPointerVector::iterator source = next->s.begin();source != next->s.end(); ++source) {// found a user in the blockif (*source->pointer == *reg->pointer) return false;}

}}

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 48

Dead Code Elimination Repeat for

phi instructions Other instructions alive-in values

Ensures meta-data is valid

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 49

Running Passes on PTX Static optimizer

PTXOptimizer Runs passes on PTX assembly files ocelot/tools/PTXOptimizer.cpp

JIT optimization Runs passes before kernels are launched ocelot/api/implementation/OcelotRuntime.cpp

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 50

Questions GPU Ocelot

Google Code site: http://code.google.com/p/gpuocelot

Research Project site: http://gpuocelot.gatech.edu

Mailing list: gpuocelot@googlegroups.com

Contributors Gregory Diamos, Rodrigo Dominguez, Naila Farooqui, Andrew Kerr, Ashwin

Lele, Si Li, Tri Pho, Jin Wang, Haicheng Wu, Sudhakar Yalamanchili

Sponsors AMD, IBM, Intel, LogicBlox, NSF, NVIDIA